Mime-type detection done right
I'm currently facing a problem I find more than interesting: detecting the mime-type of a given file. By detecting, I mean trying to guess the mime type using only information present in the file. By file, I mean a structure that has a name and a content.
Here are the solutions I know to this problem:
- Trying to guess the file type depending on the file name. For example, if the file name is
foo.txt
, I can assume that the mime-type istext/plain
- Trying to determine the type using the content, especially the first bytes that usually contain some sort of magic code. For example, if the file begins with the octets
0xCAFEBABE
I can assume the mime-type isapplication/x-java-class
.
The two approaches to this problem come with their advantages and drawbacks.
The first solution is very efficient, but we assume that the file has a correct name, and has an extension. How to detect the mime-type of a file named LICENSE
or README
?
The second technique is a bit more complex, and has to actually read the data. It works very well for all the files containing a magic code, but works poorly for other files. Some problems may arise: how to tell the difference between a MS-DOS EXE file (starting with MZ
as magic code) and an actual text/plain
file starting with the letters MZ
. A lot of similar problems arrise when you consider other files types (txt
vs csv
; html
vs xml
vs xhtml
).
So here comes the real question: How to detect efficiently and reliabily, the mime-type of a file?
Some side notes:
- I know lots and lots of libraries exist out 开发者_运维技巧there that do the job. I'm not interested in the libraries. I'm interested in getting my hands dirty.
- No specific language. I'm interested in the general algorithm(s), not a specific implementation.
The answer to your question is probably just "regular expressions" as you are asking for algorithms, not tools. Actually looking for patterns in a file to guess what it is surely is the very best way to decide what it is. If in doubt, you can look at the file extension (if available) as well but you shouldn't rely on it. For example, on UNIX systems the OS doesn't care about a file extension when deciding whether it can execute a file or not. So the file extension should never be relied on.
The task itself is trivial from an algorithmic point of view: gather regular expressions that identify different file types. But that's a lot of work, for every file type you'd like to have recognized you need to get familiar with its design to actually be able to write an expression that really does recognize the file type with only minimum of false positives and false negatives.
So why bother and trying to solve a problem that other people have already invested heavily in ? As you probably know, the most widespread solution is the UNIX tool file and its library libmagic
, which can be used in your programs easily. Bindings to the most common scripting languages exist. The file
utility's "magic" database is probably the most comprehensive out there, knowing about exotic file types you've never heard of before (since they're out of widespread use for years or decades) and having been tuned and fixed for a long time now (a whooping 38 years now).
精彩评论