How to detect file type from content in Java
This tutorial shows how to detect file type from content in Java using Apache Tika.
File format detection is a usually required in search engines where crawled resources are required to be analysed, classified, tagged and indexed. We will use Apache’s Tika here to do the job.
Apache Tika is a content analysis toolkit that detects the file format based on the file contents. Further, it can extract metadata and text content from various documents – from PPT to CSV to PDF – using existing parser libraries. Tika unifies these parsers under a single interface to allow easy parsing of over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more.
The file type detector class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
package com.wilddiary.utils; import java.io.IOException; import java.nio.file.Path; import java.nio.file.spi.FileTypeDetector; import org.apache.tika.Tika; import org.apache.tika.mime.MimeTypes; public class TikaFileTypeDetector extends FileTypeDetector { private final Tika tika = new Tika(); public TikaFileTypeDetector() { super(); } @Override public String probeContentType(Path path) throws IOException { // Try to detect based on the file name only for efficiency String fileNameDetect = tika.detect(path.toString()); if(!fileNameDetect.equals(MimeTypes.OCTET_STREAM)) { return fileNameDetect; } // Then check the file content if necessary String fileContentDetect = tika.detect(path.toFile()); if(!fileContentDetect.equals(MimeTypes.OCTET_STREAM)) { return fileContentDetect; } // Specification says to return null if we could not // conclusively determine the file type return null; } } |
Main class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
package com.wilddiary.utils; import java.io.IOException; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.spi.FileTypeDetector; public class TikaExample { public static void main(String[] args) throws IOException { // expects file path as the program argument if (args.length != 1) { printUsage(); return; } Path path = Paths.get(args[0]); FileTypeDetector detector = new TikaFileTypeDetector(); // Analyse the file - first based on file name for efficiency. // If cannot determine based on name and then analyse content String contentType = detector.probeContentType(path); System.out.println("File is of type - " + contentType); } public static void printUsage() { System.out.print("Usage: java -classpath ... " + TikaExample.class.getName() + " "); } } |
Running the example
To test the example, we rename tika-1.6-src.zip file and rename it as tika-1.6-src to remove its extension to force Tika to analyze the file contents to detect its type. The command line to launch our main class would be as shown below. Our example will accept the file to be analyzed as a command line parameter and detect file type from content.
1 2 |
java -classpath .:<parent directory of our example package>:tika-app-1.6.jar com.wilddiary.utils.TikaExample <full path to tika-1.6-src> |
Output
1 2 |
File is of type - application/zip |
File file = new File(FILENAME);
String mimeType = new Tika().detect(file);