Determine whether a file contains binary or ASCII data

2023-01-23 20:38 问答作者：

I take a file as an input argument and I need to determine whether or not the data is binary or not (well, ASCII or binary I guess), similar to the 'file' command on *nix, but within my application.

I'm not sure how to do this, as when I'm reading data I'm doing it as such:

fread(&rndByte, sizeof(unsigned int), 1, fp);
// reading one unsigned int at a time from file fp

I was thinking of testing if the value is < 128 numerous times, but no idea how to test this when reading an entire int at a time. I've though of looping ove开发者_运维问答r 1 byte at a time and checking that way, but the system I'm on doesn't like the shifts I'm doing.

Any ideas, suggestions?

I was thinking of testing if the value is < 128

It's naïve to think that text, even in English, will never contain characters outside Basic Latin. Microsoft® programs especially are fond of adding dashes — and “smart quotes” to text.

A better approach is to look for ASCII control characters. A text file will tend to have a lot of line breaks (\n and/or \r depending on platform), and perhaps some tabs, but almost never any of the other control characters.

As others have said (albeit less bluntly) it's completely backwards to limit text to ASCII in 2010. As the probability of non-text binary data parsing as UTF-8 is extremely low, a much better approach would be to try parsing the whole file as UTF-8, and declaring it binary upon the first failure.

As others have also said, rather than calling fread or fgetc over and over again on tiny units, you should fread large chunks (1-4k) at a time into fixed-size buffer and run your parser over that, reading a new chunk whenever you reach the end. (And if your UTF-8 parser is not easily restartable, it might make sense to memcpy the end of the buffer back to the beginning and refill whenever you have fewer than 4 bytes left in the buffer.)

Use fread() to grab a whole 1024 byte (or 512 or whatever works for you) buffer and then scan that buffer byte by byte looking for something with the eighth bit set. That's probably pretty close to what file(1) does except file(1) has more complex patterns to consider and it probably doesn't bother with such a large buffer.

You could also grab the source for find and learn how it operates.

继续阅读：ascii binary-data c file-io file-type

Determine whether a file contains binary or ASCII data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？