开发者

ignore malformed XML with Perl-XML

I'm using the perl command line utility xpath to extract data from some HTML code as follows:

#!/bin/bash
echo $HTML | xpath -q -e "//h2[1]"

The HTML is malformed which causes xpath to throw the below error:

not well-formed (invalid token) at line X, column Y, byte Z:

I can't really fix the HTML since it's provided by an external source which means every time the HTML is changed I would have to fix it m开发者_如何学JAVAanually again.

I looked for xpath man which is pretty empty: http://www.linuxcertif.com/man/1/xpath.1p/

I was wondering whether there would be a way to tell xpath to ignore the malformed HTML. To give you an idea of how malformed it is here are few lines from the source code:

<div id="header-background" style="top: 42px; >&nbsp;</div> <---- missing closing "
<div id-"page-inner">   <---- - instead of =

Thanks


Try out HTML::TreeBuilder::XPath which uses an HTML parser to build a document which can then be queried using xpaths. An HTML Parser should be ok with malformed XML.

Also see this article on HTML Scraping with XPath.


xml_grep, a command line tool which comes with XML::Twig, can be used to extract data from HTML using XPath. Normally it works on XML, but you can use the -html option to process HTML (under the hood it uses HTML::TreeBuilder to convert the XML to HTML).

For example:

> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜