Windows Text Encoding Question

2023-03-13 14:17 问答作者：

I'm trying to read meta data from a music (m4a) file. I have successfully figured out how to navigate around the file to get to the meta data. Documentation on the file format is hard to come by but what I have found claims that the encoding of the meta data is UTF-8.

Here’s my problem that I have been pulling my hair out with. I’m using Visual Basic 2008 to access and read data from the file. I access the file using the BinaryStreamReader methods. But cannot find an encoding setting that will handle the meta data tags AND the meta data itself. Following is a hex string of a sample of the data I’m working with.

00 00 00 21 A9 6E 61 6D 开发者_Go百科 00 00 00 19 64 61 74 61 00 00 00 01 00 00 00 47 6C C3 B3 73 C3 B3 6C 69

The last 9 bytes being the name of a track called Glósóli – so definitely UTF-8. If I set encoding to UTF-8 I can retrieve and display this value correctly. However the 4 character meta tag name A9 6E 61 6D is retrieved as “square box”nam instead of ©nam If I change the encoding to Windows-1252 I get ©nam correctly but the track name is gibberish !! Can you please explain to me why the UTF-8 encoding is not recognizing the 0xA9 byte correctly? I have also noticed that looking at the same 2 character strings for ©nam and Glósóli in Notepad++ produces similar results. If the Format is set to Encode in UTF-8 the © character is not displayed. If Format is set to ANSII it is but the track name is incorrect. I cannot find any setting that displays the desired result. I’m sure the answer is obvious but I’m not seeing it. Any help or explanation would be greatly appreciated

I'm running Windows XP with all the latest patches

Mike

The problem is that A9 doesn't encode a UTF-8 character. Unicode codepoints are not the same as the encoded values; U+00A9 is encoded in UTF-8 as C2 A9. (UTF-8 uses the high bit of bytes to indicate multibyte characters, with additional bits indicating the number of following bytes within the character; this allows a program to always be able to find the start of a valid character even if it's given a pointer into the middle of a multibyte character, which is part of how UTF-8 retains compatibility with older programs that don't understand Unicode.)

Decoding the .m4a file will require decoding each field independently; you will need to use an ISO 8859/1 codec on the tag names and the appropriate codec (which for strings will often but not always be UTF-8) for tag values.

(By the way, the fact that U+00A9 encodes to UTF-8 with its second byte as A9 is more or less accidental; the first two bits of the latter are part of the UTF-8 encoding: 10 meaning part of a multibyte sequence with no following characters; more details linked here. The 2 in C2 actually represents the top of the original A0.)

BTW, here is the .NET documentation for System.Text.UTF8Encoding; by following the class hierarchy diagram you can get to other .NET codecs.

A9 on its own - or as in this case surrounded by low-bytes (i.e. in the range 00-7F) cannot be part of a UTF-8 sequence. Take a look at the wikipedia entry for example, and you'll see that all high-bytes (80-FF) occur as part of a multi-byte UTF-8 sequence.

So - some of the data in your file is other non-UTF-8 stuff - possibly meta-data.

继续阅读：encoding utf-8

Windows Text Encoding Question

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？