开发者

What happens to a null byte when converting bytes to ISO 8859-1 encoding?

I'm not entirely sure if the question even makes sense. I'm converting a byte array taken from an ID3 tag and converting it to a string. Most text frames in an ID3 tag use ISO 8859-1 encoding but it depends on the frame. In any case, if you look up what 0x00 is in the ISO 8859-1 codes it is invalid.

To further complicate, either due programmer error or just poor formatting, some of the strings end in 0x00 and some do not.

When converting a series of bytes into a string using ISO 8859-1 encoding do you have manually check the end of the string to see if it is a null? Or will the encoding object through whatever method it uses to convert in the first place deal with the null properly? Furthermore, is there some sort of function that could normalize or "fix" the null terminated string?

When you try to display these strings they do not display properly.

I am using C# for this particular project. Some extra info here about ID3 Tags: ID3 Specs

O开发者_如何学Gor am I completely misunderstanding the whole thing? Is a null terminator simply a way a particular language handles strings and it has nothing to do with encoding?

  • Edit: I used System.Text.Encoding.GetEncoding("iso-8859-1") followed by a GetString call


If you use Encoding.GetEncoding(28591), it just converts a byte 0 to the Unicode U+0000. Encodings generally assume that they have to convert all the bytes - they don't look for terminators.

This treatment of 0 as Unicode 0 is inline with the Wikipedia description:

In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.

The C0 and C1 control characters page includes:

0: Originally used to allow gaps to be left on paper tape for edits. Later used for padding after a code that might take a terminal some time to process (e.g. a carriage return or line feed on a printing terminal). Now often used as a string terminator, especially in the C programming language.

Sample code:

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        byte[] data = { 0, 0 };
        Encoding latin1 = Encoding.GetEncoding(28591);

        string text = latin1.GetString(data);
        Console.WriteLine(text.Length); // 2
        Console.WriteLine((int) text[0]); // 0
        Console.WriteLine((int) text[1]); // 0
    }
}


Happily, ASCII, ISO-8859-1 and Unicode all agree on codepoints in the range 0..127. Thus your character '\0' will be encoded identically in ASCII, ISO-8859-1 and UTF-8.

If your program assigns special semantics to the zero byte, you have to take care of that appropriately.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜