HTML DOM Parse and Character encoding on XMLHTTPRequest at Firefox extension

2023-03-03 12:38 问答作者：

I am now writing firefox 4 bootstrapped extension.

The following is my story:

When I'm using @mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest, content of target URL can be successfully loaded by req.responseText.

I parsed the responseText to DOM by createElement method and innerHTML property into a BODY Element.

Everything seems to be successful.

However, there is a problem on character encoding ( charset ).

As I need the extension detect the charset of target documents, overriding the Mine type of request with text/html; charset=blahblah.. seems not to meet my need.

I've tried the @mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService, but it seems that XMLHTTPRequest has no ScriptableInputStream or even any InputStream or readable stream.

I h开发者_JAVA技巧ave no idea on reading a target document content in a suitable, automatically detected charset, no matter the function of Auto-Detect Character Encoding in GUI or the charset readed at head meta tag of the content document.

EDIT: Would it be practical if I parse whole document including HTML, HEAD, BODY tag to a DOM object, but without loading extensive document like js, css, ico files?

EDIT: Method on the article at MDC titled as "HTML to DOM" which is using @mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML is inappropriate as it parsed with lots of error and mistake with baseURI can not be set in type of text/html. All attribute HREF in A Elements are missed when it contains a relative path.

EDIT#2: It would still be nice if there are any methods that can convert the incoming responseText into readable UTF-8 charset strings. :-)

Any ideas or works to solve encoding problem are appreciated. :-)

PS. the target documents are universal so there are no specific charset ( or ... preknown ), and of course not only UTF8 as it has already defined in default.

SUPP:

Til now, I have two brief main ideas of solving this problem.

Can anybody could help me to work out of the XPCOM modules and methods' names?

To Specify the charset while parsing Content into DOM.

We need to first know the charset of the document ( by extracting head meta Tag, or header). Then,

find out a method that can specify the charset when parsing body content.
find out a method that can parse both head and body.

To Convert or Make Incoming responseText into/be UTF-8 so parsing to DOM Element with default charset UTF-8 is still working.

X seems to be not practical and sensible : Overiding the Mine type with charset is an implementation of this idea but we can not preknow the charset before initiating a request.

It seems that there are no more other answer.

After a day of tests, I've found out that there is a way (although it is clumsy) to solve my problem.

xhr.overrideMimeType('text/plain; charset=x-user-defined'); , where xhr stand for XMLHttpRequest Handler.

To force Firefox to treat it as plain text, using a user-defined character set. This tells Firefox not to parse it, and to let the bytes pass through unprocessed.

Refers to MDC Document: Using_XMLHttpRequest#Receiving_binary_data

And then use Scriptable Unicode Converter : @mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter

Variable charset can be extracted from head meta tags no matter by regexp from req.responseText (with unknown charset) or something other method.

var unicodeConverter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);

An unicode string, as well as a family of UTF-8, is finally produced. :-)

Then simply parse to body element and meet my need.

Other brilliant ideas are still welcome. Feel free to object my answer by sufficient reason. :-)

继续阅读：character-encoding firefox-addon gecko html-parsing xmlhttprequest

HTML DOM Parse and Character encoding on XMLHTTPRequest at Firefox extension

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？