开发者

how to know which special character is there in a file?

My app needs to process text开发者_Python百科 files during a batch process. Occassionally I receive a file with some special character at the end of the file. I am not sure what that special character is. Is there anyway I can find what that character is so that I can tell the other team which is producing that file.

I have used mozilla's library to guess the file encoding and it says UTF-8.


First, if the character is really "special" or not depends what you call a "special character". As a sidenote on Unix and OS X you can use, for example, the od, file and hexdump commands to easily examine files:

... $  hexdump -C example.txt 
00000530  6f 77 73 20 61 63 74 69  6f 6e 2e 0a 0a 0a 0a     |ows action.....|

Now if you know your file encoding is UTF-8, it means that every byte that has its highest bit set to zero correspond to exactly one character (in the example above, last byte is '0a', which means the '0a' byte correspond to one "character").

A file in UTF-8 also means that every byte whose highest bit is set to 1 is part of a multi-byte character. For example, in the following byte sequence:

75 20 5b e2 80 a6 5d 20  61 75 74 6f 72 69 73 61

the only three bytes that have their highest bit set are "e2 80 a6" (all the values from 0x80 to 0xFF have their leftmost/highest bit set) and they're part of the same character (you cannot have a non-ASCII character in UTF-8 made of only one byte whose highest bit is set, hence you know that these three bytes are part of the same character... The fact that every UTF-8 byte whose leftmost/highest bit is set is IMHO a truly beautiful feature of UTF-8).

Now you Google on "e2 80 a6" and you see that it's the Unicode character named "horizontal ellipsis" (whose codepoint, in UTF-8, is represented by hexadecimal e280a6).

So basically you have to do two things:

  • find which bytes are making up that last "special" character (is it just one byte or several bytes?)

  • find to which "special character" this/these byte(s) corresponds


Any hex editor ought to allow you to see each individual byte in a file. This ought to allow you to tell them what character it is.

Here's one I've used in the past: http://www.hexworkshop.com/


On Unix, you can use the od utility to output several representations of byte data in a file or stream.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜