Is there a class I can use to extract elements from messy HTML
I've got a requirement to grab text out of some pretty messy html. Lets say I need the 3rd list item from the first list in the page. There may or may not be closing tags on the li's, they may be in mixed cases, have classes etc.
I was wondering if, in a console application, is is possible to use a class (DOMDocument???) to load the HTML into a DOM, which would atleast sanitize it somewhat, then parse it out of there.
This seems like something that should be solved already, but I've not found anything too relevant except this vintage regex solution http://www.vsj.co.uk/articles/display.asp?id=389
Any thoughts on if t开发者_运维问答his is a good approach and the correct classes to investigate would be appreciated.
The Html Agility Pack can be used to work with 'messy' Html in a DOM fashion.
精彩评论