开发者

Determine if a text file without BOM is UTF8 or ASCII

Long story short:

+ I'm using ffmpeg to check the artist name of a MP3 file.

+ If the artist has asian characters in its name the output is UTF8.

+ If it just has ASCII characters the output is ASCII.

The output does not use any BOM indication at the beginning.

The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.

How can I tell whether or not the output text file from ffmpeg is UTF8 or not? T开发者_运维技巧he application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/

Something like this would be perfect:

http://linux.die.net/man/1/isutf8

If anyone knows of a Windows version?

Thanks a lot in before hand guys!


This program/source might help you:

  • Detect Encoding for In- and Outgoing

Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...


You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.

However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.

Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).

You can check a file for 7-bit ASCII compliance..

# If nothing is output to stdout, the file is 7-bit ASCII compliant 
# Output lines containing ERROR chars -- to stdout

  perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"

Here is a similar check for UTF-8 compliance..

perl -l -ne '/
   ^( ([\x00-\x7F])              # 1-byte pattern
     |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
     |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
     |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
    )*$ /x or print' "$1"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜