开发者

Convert ISO/Windows charsets to UTF-8 in Javascript

I'm developing a firefox plugin and i fetch web pages to do some analysis for the user. The problem is when i try to get (XMLHttpRequest) pages that are not utf-8 encoded the string i see is messed up. For example hebrew pages with windows-1125 or Chinese pages with gb2312.

I already tried the f开发者_开发百科ollowing:

var uDecoder=Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].getService(Components.interfaces.nsIScriptableUnicodeConverter);
uDecoder.charset="windows-1255";
alert( xhr.responseText );

var decoder=Components.classes["@mozilla.org/intl/utf8converterservice;1"].getService(Components.interfaces.nsIUTF8ConverterService);

alert(decoder.convertStringToUTF8(xhr.responseText,"WINDOWS-1255",true)); 

I also tried escape/unescape/encodeURIComponent

any ideas???


Once XMLHttpRequest has tried to decode a non-UTF-8 string using UTF-8, you've already lost. The byte sequences in the page that weren't valid UTF-8 sequences will have been mangled (typically converted to , the U+FFFD replacement character). No amount of re-encoding/decoding will get them back.

Pages that specify a Content-Type: text/html;charset=something HTTP header should be OK. Pages that don't have a real HTTP header but do have a <meta> version of it won't be, because XMLHttpRequest doesn't know about parsing HTML so it won't see the meta. If you know in advance the charset you want, you can tell XMLHttpRequest and it'll use it:

xhr.open(...);
xhr.overrideMimeType('text/html;charset=gb2312');
xhr.send();

(This is a currently non-standardised Mozilla extension.)

If you don't know the charset in advance, you can request the page once, hack about with the header for a <meta> charset, parse that out and request again with the new charset.

In theory you could get a binary response in a single request:

xhr.overrideMimeType('text/html;charset=iso-8859-1');

and then convert that from bytes-as-chars to UTF-8. However, iso-8859-1 wouldn't work for this because the browser interprets that charset as really being Windows code page 1252.

You could maybe use another codepage that maps every byte to a character, and do a load of tedious character replacements to map every character in that codepage to the character it would have been in real-ISO-8859-1, then do the conversion. Most encodings don't map every byte, but Arabic (cp1256) might be a candidate for this?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜