开发者

HTML DOM Parse and Character encoding on XMLHTTPRequest at Firefox extension

I am now writing firefox 4 bootstrapped extension.


The following is my story:

When I'm using @mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest, content of target URL can be successfully loaded by req.responseText.

I parsed the responseText to DOM by createElement method and innerHTML property into a BODY Element.

Everything seems to be successful.

However, there is a problem on character encoding ( charset ).

As I need the extension detect the charset of target documents, overriding the Mine type of request with text/html; charset=blahblah.. seems not to meet my need.

I've tried the @mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService, but it seems that XMLHTTPRequest has no ScriptableInputStream or even any InputStream or readable stream.

I h开发者_JAVA技巧ave no idea on reading a target document content in a suitable, automatically detected charset, no matter the function of Auto-Detect Character Encoding in GUI or the charset readed at head meta tag of the content document.


EDIT: Would it be practical if I parse whole document including HTML, HEAD, BODY tag to a DOM object, but without loading extensive document like js, css, ico files?

EDIT: Method on the article at MDC titled as "HTML to DOM" which is using @mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML is inappropriate as it parsed with lots of error and mistake with baseURI can not be set in type of text/html. All attribute HREF in A Elements are missed when it contains a relative path.

EDIT#2: It would still be nice if there are any methods that can convert the incoming responseText into readable UTF-8 charset strings. :-)


Any ideas or works to solve encoding problem are appreciated. :-)

PS. the target documents are universal so there are no specific charset ( or ... preknown ), and of course not only UTF8 as it has already defined in default.


SUPP:

Til now, I have two brief main ideas of solving this problem.

Can anybody could help me to work out of the XPCOM modules and methods' names?


To Specify the charset while parsing Content into DOM.

We need to first know the charset of the document ( by extracting head meta Tag, or header). Then,

  • find out a method that can specify the charset when parsing body content.
  • find out a method that can parse both head and body.

To Convert or Make Incoming responseText into/be UTF-8 so parsing to DOM Element with default charset UTF-8 is still working.

X seems to be not practical and sensible : Overiding the Mine type with charset is an implementation of this idea but we can not preknow the charset before initiating a request.


It seems that there are no more other answer.

After a day of tests, I've found out that there is a way (although it is clumsy) to solve my problem.

xhr.overrideMimeType('text/plain; charset=x-user-defined'); , where xhr stand for XMLHttpRequest Handler.

To force Firefox to treat it as plain text, using a user-defined character set. This tells Firefox not to parse it, and to let the bytes pass through unprocessed.

Refers to MDC Document: Using_XMLHttpRequest#Receiving_binary_data

And then use Scriptable Unicode Converter : @mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter

Variable charset can be extracted from head meta tags no matter by regexp from req.responseText (with unknown charset) or something other method.

var unicodeConverter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);

An unicode string, as well as a family of UTF-8, is finally produced. :-)

Then simply parse to body element and meet my need.

Other brilliant ideas are still welcome. Feel free to object my answer by sufficient reason. :-)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜