Checking all files are encoded as UTF-8

2022-12-12 06:57 问答作者：

Does anyone know of a Windows app that can scan through a directory and check which scripts are/aren't encoded开发者_如何转开发 as a specified charset (UTF-8 in this case)? I could do it manually, but that could take a while and is quite error prone!

UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.

If you do want to write a program for detecting those sequences, it's pretty easy:

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec

Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.

For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.

The other thing to do is ensure that all continuation octets start with 10.

Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.

My hdump utility is available at: http://david.tribble.com/programs.html

继续阅读：character-encoding utility windows

Checking all files are encoded as UTF-8

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？