How does Nokogiri handle unclosed HTML tags like <br>?
When parsing HTML document, how Nokogiri handle <br>
tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br>
tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but开发者_如何转开发 I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you?
is not a content of the first <br>
as it would be in XML).
Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br>
is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.
As far as I can remember from doing some HTML parsing last year it'll view them as separate.
EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including <br>
separately.
精彩评论