开发者

Algorithms to fix a broken HTML

I am looking for algorithms & data structures one would use to fix broken HTML. I know lots of inbuilt tools开发者_开发知识库 exist in every language to do this. But I want to learn this. Some approaches I can think of is -

  1. Using Regular Expressions - seems like a naive approach
  2. Create DOM - but how would DOM tree get created with broken html?

UPDATE: This is more of a general discussion I am expecting. But if you refer to any tools in C, C++, Python or Java is fine by me.

thanks


Parse the markup using the HTML 5 parsing algorithm (which is designed to handle brokenness), and build a DOM from it. You can then serialize back to HTML.


RegEx + HTML = disaster.

There are just too many ways for HTML to be valid SGML yet break RegEx rules.

Really you need stateful SGML parsers. You don't mention what languages you're willing to work in, but there are many stateful SGML parsers out there.

In .NET we regularly use SGMLReader - a stateful parser that returns wellformed DOM and/or XML DOM.

In C, W3C has a reasonable C SGML Parser

In Java there is a SAX-style SGML parser


I agree with the idea that the regular expressions road is long and tortuous: it is much more robust and easier to use existing codes designed just for reading broken HTLM.

Since you mention Python, the Beautiful Soup parser reputedly handles broken HTML quite nicely.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜