开发者

Checking all files are encoded as UTF-8

Does anyone know of a Windows app that can scan through a directory and check which scripts are/aren't encoded开发者_如何转开发 as a specified charset (UTF-8 in this case)? I could do it manually, but that could take a while and is quite error prone!


UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.

If you do want to write a program for detecting those sequences, it's pretty easy:

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec 

Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.

For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.

The other thing to do is ensure that all continuation octets start with 10.


Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.

My hdump utility is available at: http://david.tribble.com/programs.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜