开发者

Nokogiri: Parsing Irregular "<"

I am trying to use nokogiri to parse the following segment

<tr>
 <th>Total Weight</th>
 <td>< 1 g</td>
 <td style="text-align: right">0 %</td>

</tr>             
<tr><td class="skinny_black_bar" colspan="3"></td></tr>

However, I think the "<" sign in "< 1 g" is causing Nokogiri problems. Does anyone know any workarounds? Is there a way I can escape the "<" sign? Or maybe there is a function I can call to ju开发者_StackOverflowst get the plain html segment?


As a quick fix I came up with this method using a reqular expression to identify unclosed tags:

def fix_irregular_html(html)
  regexp = /<([^<>]*)(<|$)/

  #we need to do this multiple time as regex are overlapping
  while (fixed_html = html.gsub(regexp, "&lt;\\1\\2")) && fixed_html != html
    html = fixed_html
  end

  fixed_html
end

See full code including test here: https://gist.github.com/796571

It works out well for me, I appreciate any feedback and improvements


The "less than" (<) isn't legal HTML, but browsers have a lot of code for figuring out what was meant by the HTML instead of just displaying an error. That's why your invalid HTML sample displays the way you'd want it to in browsers.

So the trick is to make sure Nokogiri does the same work to compensate for bad HTML. Make sure to parse the file as HTML instead of XML:

f = File.open("table.html")
doc = Nokogiri::HTML(f)

This parses your file just fine, but throws away the < 1 g text. Look at how the content of the first 2 TD elements is parsed:

doc.xpath('(//td)[1]/text()').to_s
=> "\n "

doc.xpath('(//td)[2]/text()').to_s
=> "0 %"

Nokogiri threw out your invalid text, but kept parsing the surrounding structure. You can even see the error message from Nokogiri:

doc.errors
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>]
doc.errors[0].line
=> 3

Yup, line 3 is bad.

So it seems like Nokogiri doesn't have the same level of support for parsing invalid HTML as browsers do. I recommend using some other library to pre-process your files. I tried running TagSoup on your sample file and it fixed the < by changing it to &lt; like so:

% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format -
src: foo.html
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <table>
      <tbody>
        <tr>
          <th colspan="1" rowspan="1">Total Weight</th>
          <td colspan="1" rowspan="1">&lt;1 g</td>
          <td colspan="1" rowspan="1" style="text-align: right">0 %</td>
        </tr>
        <tr>
          <td colspan="3" rowspan="1" class="skinny_black_bar"/>
        </tr>
      </tbody>
    </table>
  </body>
</html>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜