How to detect file type from content in Java

This tutorial shows how to detect file type from content in Java using Apache Tika.

File format detection is a usually required in search engines where crawled resources are required to be analysed, classified, tagged and indexed. We will use Apache’s Tika here to do the job.

Apache Tika is a content analysis toolkit that detects the file format based on the file contents. Further, it can extract metadata and text content from various documents – from PPT to CSV to PDF – using existing parser libraries. Tika unifies these parsers under a single interface to allow easy parsing of over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more.

The file type detector class

Main class

Running the example

To test the example, we rename tika-1.6-src.zip file and rename it as tika-1.6-src to remove its extension to force Tika to analyze the file contents to detect its type. The command line to launch our main class would be as shown below. Our example will accept the file to be analyzed as a command line parameter and detect file type from content.

Output

 


 

1
Leave a Reply

avatar
300
1 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
1 Comment authors
Paco Recent comment authors
  Subscribe  
newest oldest
Notify of
Paco
Guest
Paco

File file = new File(FILENAME);
String mimeType = new Tika().detect(file);