Create XML object from poorly formatted HTML
I want to make an XML document from an HTML one so I can use the XML parsing tools. My problem is that my HTML is not guaranteed to be XHTML nor valid. How can I bypass the exceptions? In this string <p>
is 开发者_StackOverflow社区not terminated, nor is <br>
nor <meta>
.
var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
var html:XML = new XML(poorHtml);
TypeError: Error #1085: The element type "meta" must be terminated by the matching end-tag "</meta>".
I did some searching and couldn't come up with anything except this doesn't really seem possible, the major issue is how should it correct when the format is not valid.
In the case of browsers, every browser does this based upon it's own rules of what should happen in the case that the closing tag isn't found (put it in wherever it would cause the code to produce a valid XML and subsequently DOM tree, or self terminate the tag, or remove the tag, or for the case that a closing tag was found with no opening how should this be handled, what about unclosed attributes etc.).
Unfortunately I don't know of anything in the specification that explains what should be done in this case, with XHTML just like how flex treats it these are fatal errors and result in no functionality rather than how HTML4 treated it with the quirky and transitional DTD options.
To avoid the error or give better error messaging you can use this:
var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
try
{
var html:XML = new XML(poorHtml);
}
catch(e:TypeError)
{
trace("error caught")
}
but it's likely you'll be best off using some sort of server side script to validate the XML or correct the XML before passing it over to the client.
There is probably an implementation of HTML Tidy in just about any language you might happen to be working with. This looks promising for your sitation: http://code.google.com/p/as3htmltidylib/
If you don't want to drag in a whole library (I wouldn't), you could just write your own XML parser that handles errors in whatever way suits you (I'd suggest auto-closing tags until the document makes sense again, ignoring end tags with no start tags, maybe un-closing certain special tags such as "body" and "html"). This has the added advantage that you can optimize it for whatever jobs you need it for, i.e. by storing a list of all elements with the attribute "href" as you come to them.
You could try to pass your HTML through HTML Tidy on the server before loading it. I believe that HTML Tidy does a good job at cleaning up broken HTML.
精彩评论