Nokogiri: Parsing Irregular "<"
I am trying to use nokogiri to parse the following segment
<tr>
<th>Total Weight</th>
<td>< 1 g</td>
<td style="text-align: right">0 %</td>
</tr>
<tr><td class="skinny_black_bar" colspan="3"></td></tr>
However, I think the "<" sign in "< 1 g" is causing Nokogiri problems. Does anyone know any workarounds? Is there a way I can escape the "<" sign? Or maybe there is a function I can call to ju开发者_StackOverflowst get the plain html segment?
As a quick fix I came up with this method using a reqular expression to identify unclosed tags:
def fix_irregular_html(html)
regexp = /<([^<>]*)(<|$)/
#we need to do this multiple time as regex are overlapping
while (fixed_html = html.gsub(regexp, "<\\1\\2")) && fixed_html != html
html = fixed_html
end
fixed_html
end
See full code including test here: https://gist.github.com/796571
It works out well for me, I appreciate any feedback and improvements
The "less than" (<) isn't legal HTML, but browsers have a lot of code for figuring out what was meant by the HTML instead of just displaying an error. That's why your invalid HTML sample displays the way you'd want it to in browsers.
So the trick is to make sure Nokogiri does the same work to compensate for bad HTML. Make sure to parse the file as HTML instead of XML:
f = File.open("table.html")
doc = Nokogiri::HTML(f)
This parses your file just fine, but throws away the < 1 g
text. Look at how the content of the first 2 TD elements is parsed:
doc.xpath('(//td)[1]/text()').to_s
=> "\n "
doc.xpath('(//td)[2]/text()').to_s
=> "0 %"
Nokogiri threw out your invalid text, but kept parsing the surrounding structure. You can even see the error message from Nokogiri:
doc.errors
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>]
doc.errors[0].line
=> 3
Yup, line 3 is bad.
So it seems like Nokogiri doesn't have the same level of support for parsing invalid HTML as browsers do. I recommend using some other library to pre-process your files. I tried running TagSoup on your sample file and it fixed the <
by changing it to <
like so:
% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format -
src: foo.html
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<table>
<tbody>
<tr>
<th colspan="1" rowspan="1">Total Weight</th>
<td colspan="1" rowspan="1"><1 g</td>
<td colspan="1" rowspan="1" style="text-align: right">0 %</td>
</tr>
<tr>
<td colspan="3" rowspan="1" class="skinny_black_bar"/>
</tr>
</tbody>
</table>
</body>
</html>
精彩评论