开发者

Windows Text Encoding Question

I'm trying to read meta data from a music (m4a) file. I have successfully figured out how to navigate around the file to get to the meta data. Documentation on the file format is hard to come by but what I have found claims that the encoding of the meta data is UTF-8.

Here’s my problem that I have been pulling my hair out with. I’m using Visual Basic 2008 to access and read data from the file. I access the file using the BinaryStreamReader methods. But cannot find an encoding setting that will handle the meta data tags AND the meta data itself. Following is a hex string of a sample of the data I’m working with.

00 00 00 21 A9 6E 61 6D 开发者_Go百科 00 00 00 19 64 61 74 61 00 00 00 01 00 00 00 47 6C C3 B3 73 C3 B3 6C 69

The last 9 bytes being the name of a track called Glósóli – so definitely UTF-8. If I set encoding to UTF-8 I can retrieve and display this value correctly. However the 4 character meta tag name A9 6E 61 6D is retrieved as “square box”nam instead of ©nam If I change the encoding to Windows-1252 I get ©nam correctly but the track name is gibberish !! Can you please explain to me why the UTF-8 encoding is not recognizing the 0xA9 byte correctly? I have also noticed that looking at the same 2 character strings for ©nam and Glósóli in Notepad++ produces similar results. If the Format is set to Encode in UTF-8 the © character is not displayed. If Format is set to ANSII it is but the track name is incorrect. I cannot find any setting that displays the desired result. I’m sure the answer is obvious but I’m not seeing it. Any help or explanation would be greatly appreciated

I'm running Windows XP with all the latest patches

Mike


The problem is that A9 doesn't encode a UTF-8 character. Unicode codepoints are not the same as the encoded values; U+00A9 is encoded in UTF-8 as C2 A9. (UTF-8 uses the high bit of bytes to indicate multibyte characters, with additional bits indicating the number of following bytes within the character; this allows a program to always be able to find the start of a valid character even if it's given a pointer into the middle of a multibyte character, which is part of how UTF-8 retains compatibility with older programs that don't understand Unicode.)

Decoding the .m4a file will require decoding each field independently; you will need to use an ISO 8859/1 codec on the tag names and the appropriate codec (which for strings will often but not always be UTF-8) for tag values.

(By the way, the fact that U+00A9 encodes to UTF-8 with its second byte as A9 is more or less accidental; the first two bits of the latter are part of the UTF-8 encoding: 10 meaning part of a multibyte sequence with no following characters; more details linked here. The 2 in C2 actually represents the top of the original A0.)

BTW, here is the .NET documentation for System.Text.UTF8Encoding; by following the class hierarchy diagram you can get to other .NET codecs.


A9 on its own - or as in this case surrounded by low-bytes (i.e. in the range 00-7F) cannot be part of a UTF-8 sequence. Take a look at the wikipedia entry for example, and you'll see that all high-bytes (80-FF) occur as part of a multi-byte UTF-8 sequence.

So - some of the data in your file is other non-UTF-8 stuff - possibly meta-data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜