Mime-type detection done right

2023-03-14 18:05 问答作者：

I'm currently facing a problem I find more than interesting: detecting the mime-type of a given file. By detecting, I mean trying to guess the mime type using only information present in the file. By file, I mean a structure that has a name and a content.

Here are the solutions I know to this problem:

Trying to guess the file type depending on the file name. For example, if the file name is foo.txt, I can assume that the mime-type is text/plain
Trying to determine the type using the content, especially the first bytes that usually contain some sort of magic code. For example, if the file begins with the octets 0xCAFEBABE I can assume the mime-type is application/x-java-class.

The two approaches to this problem come with their advantages and drawbacks.

The first solution is very efficient, but we assume that the file has a correct name, and has an extension. How to detect the mime-type of a file named LICENSE or README?

The second technique is a bit more complex, and has to actually read the data. It works very well for all the files containing a magic code, but works poorly for other files. Some problems may arise: how to tell the difference between a MS-DOS EXE file (starting with MZ as magic code) and an actual text/plain file starting with the letters MZ. A lot of similar problems arrise when you consider other files types (txt vs csv; html vs xml vs xhtml).

So here comes the real question: How to detect efficiently and reliabily, the mime-type of a file?

Some side notes:

I know lots and lots of libraries exist out 开发者_运维技巧there that do the job. I'm not interested in the libraries. I'm interested in getting my hands dirty.
No specific language. I'm interested in the general algorithm(s), not a specific implementation.

The answer to your question is probably just "regular expressions" as you are asking for algorithms, not tools. Actually looking for patterns in a file to guess what it is surely is the very best way to decide what it is. If in doubt, you can look at the file extension (if available) as well but you shouldn't rely on it. For example, on UNIX systems the OS doesn't care about a file extension when deciding whether it can execute a file or not. So the file extension should never be relied on.

The task itself is trivial from an algorithmic point of view: gather regular expressions that identify different file types. But that's a lot of work, for every file type you'd like to have recognized you need to get familiar with its design to actually be able to write an expression that really does recognize the file type with only minimum of false positives and false negatives.

So why bother and trying to solve a problem that other people have already invested heavily in ? As you probably know, the most widespread solution is the UNIX tool file and its library libmagic, which can be used in your programs easily. Bindings to the most common scripting languages exist. The file utility's "magic" database is probably the most comprehensive out there, knowing about exotic file types you've never heard of before (since they're out of widespread use for years or decades) and having been tuned and fixed for a long time now (a whooping 38 years now).

继续阅读：language-agnostic mime-types

Mime-type detection done right

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？