How do I search for "text" then traverse the DOM from the found node?
I have webpage that I need to scrape some data from. The problem is, each page may or may not have specific data, or it may have extra data above or below it in the DOM, and there is no CSS ids to speak of.
Typically I could use either CSS ids or XPath to get to the node I'm looking for. I don't have that option in this case. What I'm trying to do is search for the "label" text then grab the data in the next <TD>
node:
<tr>
<td><b>Name:</b></td开发者_Python百科>
<td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small></td>
</tr>
In the above HTML, I would search for:
doc.search("[text()*='Name:']")
to get the node just before the data I need, but I'm not sure how to navigate from there.
next_element
is probably the method you're looking for.
require 'nokogiri'
data = File.read "html.htm"
doc = Nokogiri::HTML data
els = doc.search "[text()*='Name:']"
el = els.first
puts "Found element:"
puts el
puts
puts "Parent element:"
puts el.parent
puts
puts "Parent's next_element():"
puts el.parent.next_element
# Output:
#
# Found element:
# <b>Name:</b>
#
# Parent element:
# <td>
# <b>Name:</b>
# </td>
#
# Parent's next_element():
# <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small>
# </td>
Note that since the text is inside <b></b>
tags, you have to go up a level (to the found element's parent <td>
) before you can get to the next sibling. If the HTML structure isn't stable, you'd have to find the first parent that is a <td>
and go from there.
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
this text
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("foo")').next_sibling.text.strip
=> "this text"
You can do the entire search in a single statement using xpath's parent / following_sibling syntax:
>> require 'nokogiri'
=> true
>> html = <<HTML
<tr>
<td><b>Name:</b></td>
<td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small></td>
</tr>
HTML
>> doc = Nokogiri::HTML(html)
>> doc.at_xpath("//*[text()='Name:']/../following-sibling::*").to_s
=> "<td>Joe Smith <small><a href=\"/Joe\"><img src=\"/joe.png\"></a></small>\n</td>"
精彩评论