ignore malformed XML with Perl-XML

2023-02-06 08:18 问答作者：

I'm using the perl command line utility xpath to extract data from some HTML code as follows:

#!/bin/bash
echo $HTML | xpath -q -e "//h2[1]"

The HTML is malformed which causes xpath to throw the below error:

not well-formed (invalid token) at line X, column Y, byte Z:

I can't really fix the HTML since it's provided by an external source which means every time the HTML is changed I would have to fix it m开发者_如何学JAVAanually again.

I looked for xpath man which is pretty empty: http://www.linuxcertif.com/man/1/xpath.1p/

I was wondering whether there would be a way to tell xpath to ignore the malformed HTML. To give you an idea of how malformed it is here are few lines from the source code:

<div id="header-background" style="top: 42px; >&nbsp;</div> <---- missing closing "
<div id-"page-inner">   <---- - instead of =

Thanks

Try out HTML::TreeBuilder::XPath which uses an HTML parser to build a document which can then be queried using xpaths. An HTML Parser should be ok with malformed XML.

Also see this article on HTML Scraping with XPath.

xml_grep, a command line tool which comes with XML::Twig, can be used to extract data from HTML using XPath. Normally it works on XML, but you can use the -html option to process HTML (under the hood it uses HTML::TreeBuilder to convert the XML to HTML).

For example:

> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange

继续阅读：html-parsing perl xml xml-parsing

ignore malformed XML with Perl-XML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？