is it possible to fix html that has unescaped < and > characters?
For example if I have this html:
<div>this is a test < text</div>
the < after the test is an error and the right html should be
<div>this is a test < text</div>
But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.
Well, the sames applies to the > character and to text that has both < and > cha开发者_运维百科racters Like "<2000> - <2004>". I would like to hear ideas for algorithms or libraries that can help me. Thanks.
Note: the html sample above is a sample and the work should be done on big html files.
I'd suggest this:
identify and map locations of all known tags, like <div>
and </a>
.
Replace < and > everywhere outside the map you built in step 1.
1) For all known html tags, replace <> with some other characters like {{{ and }}}. You can use regex more or less like this:
Regex.Replace(source,"</?((b|a|i|table|td|all|other|known|html|tags)( [^>]*))>","{{{$1}}}");
2) replace < with < and > with >
3) Replace {{{ with < and }}} with >
Using a "relaxed" HTML parser like the HTML Agility Pack for .NET would be a nice fit. You grab the tree as interpreted by the library, and then, in each node value, replace < and > for their proper counterparts.
See here for an example: Iron python, beautiful soup, win32 app
A slow way to do it would be to treat each HTML file as an XML file. Then parse through each one of the nodes of that XML file and do a Server.HTMLEnocde on the contents of the node. Since HTML is just a defined set of XML this should work.
精彩评论