How to detect file type from content in Java

This tutorial shows how to detect file type from content in Java using Apache Tika.

File format detection is a usually required in search engines where crawled resources are required to be analysed, classified, tagged and indexed. We will use Apache’s Tika here to do the job.

Apache Tika is a content analysis toolkit that detects the file format based on the file contents. Further, it can extract metadata and text content from various documents – from PPT to CSV to PDF – using existing parser libraries. Tika unifies these parsers under a single interface to allow easy parsing of over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more.

The file type detector class

Main class

Running the example

To test the example, we rename tika-1.6-src.zip file and rename it as tika-1.6-src to remove its extension to force Tika to analyze the file contents to detect its type. The command line to launch our main class would be as shown below. Our example will accept the file to be analyzed as a command line parameter and detect file type from content.

Output

 


 

0 0 votes
Article Rating
Subscribe
Notify of
guest
1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Paco
Paco
7 years ago

File file = new File(FILENAME);
String mimeType = new Tika().detect(file);