Create XML object from poorly formatted HTML

2023-02-02 05:21 问答作者：

I want to make an XML document from an HTML one so I can use the XML parsing tools. My problem is that my HTML is not guaranteed to be XHTML nor valid. How can I bypass the exceptions? In this string <p> is 开发者_StackOverflow社区not terminated, nor is <br> nor <meta>.

var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
var html:XML = new XML(poorHtml);

TypeError: Error #1085: The element type "meta" must be terminated by the matching end-tag "</meta>".

I did some searching and couldn't come up with anything except this doesn't really seem possible, the major issue is how should it correct when the format is not valid.

In the case of browsers, every browser does this based upon it's own rules of what should happen in the case that the closing tag isn't found (put it in wherever it would cause the code to produce a valid XML and subsequently DOM tree, or self terminate the tag, or remove the tag, or for the case that a closing tag was found with no opening how should this be handled, what about unclosed attributes etc.).

Unfortunately I don't know of anything in the specification that explains what should be done in this case, with XHTML just like how flex treats it these are fatal errors and result in no functionality rather than how HTML4 treated it with the quirky and transitional DTD options.

To avoid the error or give better error messaging you can use this:

var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";

try
{
    var html:XML = new XML(poorHtml);
}
catch(e:TypeError)
{
    trace("error caught")
}

but it's likely you'll be best off using some sort of server side script to validate the XML or correct the XML before passing it over to the client.

There is probably an implementation of HTML Tidy in just about any language you might happen to be working with. This looks promising for your sitation: http://code.google.com/p/as3htmltidylib/

If you don't want to drag in a whole library (I wouldn't), you could just write your own XML parser that handles errors in whatever way suits you (I'd suggest auto-closing tags until the document makes sense again, ignoring end tags with no start tags, maybe un-closing certain special tags such as "body" and "html"). This has the added advantage that you can optimize it for whatever jobs you need it for, i.e. by storing a list of all elements with the attribute "href" as you come to them.

You could try to pass your HTML through HTML Tidy on the server before loading it. I believe that HTML Tidy does a good job at cleaning up broken HTML.

继续阅读：actionscript apache-flex

Create XML object from poorly formatted HTML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？