开发者

Is there a class I can use to extract elements from messy HTML

I've got a requirement to grab text out of some pretty messy html. Lets say I need the 3rd list item from the first list in the page. There may or may not be closing tags on the li's, they may be in mixed cases, have classes etc.

I was wondering if, in a console application, is is possible to use a class (DOMDocument???) to load the HTML into a DOM, which would atleast sanitize it somewhat, then parse it out of there.

This seems like something that should be solved already, but I've not found anything too relevant except this vintage regex solution http://www.vsj.co.uk/articles/display.asp?id=389

Any thoughts on if t开发者_运维问答his is a good approach and the correct classes to investigate would be appreciated.


The Html Agility Pack can be used to work with 'messy' Html in a DOM fashion.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜