Determine whether a file contains binary or ASCII data
I take a file as an input argument and I need to determine whether or not the data is binary or not (well, ASCII or binary I guess), similar to the 'file' command on *nix, but within my application.
I'm not sure how to do this, as when I'm reading data I'm doing it as such:
fread(&rndByte, sizeof(unsigned int), 1, fp);
// reading one unsigned int at a time from file fp
I was thinking of testing if the value is < 128 numerous times, but no idea how to test this when reading an entire int at a time. I've though of looping ove开发者_运维问答r 1 byte at a time and checking that way, but the system I'm on doesn't like the shifts I'm doing.
Any ideas, suggestions?
I was thinking of testing if the value is < 128
It's naïve to think that text, even in English, will never contain characters outside Basic Latin. Microsoft® programs especially are fond of adding dashes — and “smart quotes” to text.
A better approach is to look for ASCII control characters. A text file will tend to have a lot of line breaks (\n
and/or \r
depending on platform), and perhaps some tabs, but almost never any of the other control characters.
As others have said (albeit less bluntly) it's completely backwards to limit text to ASCII in 2010. As the probability of non-text binary data parsing as UTF-8 is extremely low, a much better approach would be to try parsing the whole file as UTF-8, and declaring it binary upon the first failure.
As others have also said, rather than calling fread
or fgetc
over and over again on tiny units, you should fread
large chunks (1-4k) at a time into fixed-size buffer and run your parser over that, reading a new chunk whenever you reach the end. (And if your UTF-8 parser is not easily restartable, it might make sense to memcpy
the end of the buffer back to the beginning and refill whenever you have fewer than 4 bytes left in the buffer.)
Use fread()
to grab a whole 1024 byte (or 512 or whatever works for you) buffer and then scan that buffer byte by byte looking for something with the eighth bit set. That's probably pretty close to what file(1) does except file(1) has more complex patterns to consider and it probably doesn't bother with such a large buffer.
You could also grab the source for find
and learn how it operates.
精彩评论