开发者

is it possible to fix html that has unescaped < and > characters?

For example if I have this html:

<div>this is a test < text</div>

the < after the test is an error and the right html should be

<div>this is a test &lt; text</div>

But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.

Well, the sames applies to the > character and to text that has both < and > cha开发者_运维百科racters Like "<2000> - <2004>". I would like to hear ideas for algorithms or libraries that can help me. Thanks.

Note: the html sample above is a sample and the work should be done on big html files.


I'd suggest this:

identify and map locations of all known tags, like <div> and </a>. Replace < and > everywhere outside the map you built in step 1.


1) For all known html tags, replace <> with some other characters like {{{ and }}}. You can use regex more or less like this:

Regex.Replace(source,"</?((b|a|i|table|td|all|other|known|html|tags)( [^>]*))>","{{{$1}}}");

2) replace < with < and > with >

3) Replace {{{ with < and }}} with >


Using a "relaxed" HTML parser like the HTML Agility Pack for .NET would be a nice fit. You grab the tree as interpreted by the library, and then, in each node value, replace < and > for their proper counterparts.

See here for an example: Iron python, beautiful soup, win32 app


A slow way to do it would be to treat each HTML file as an XML file. Then parse through each one of the nodes of that XML file and do a Server.HTMLEnocde on the contents of the node. Since HTML is just a defined set of XML this should work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜