Ruby Mechanize screen scraping help
I am trying to scrape a row in a table with a date. I want to scrape only the third row that have the date today.
This is my mechanize code. I am trying to select the colum row witch have the date today and its and its columns:
agent.page.sear开发者_如何学Goch("//td").map(&:text).map(&:strip)
Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK",
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK",
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK
"
I want to only scrape the third row that is the date today.
Rather than loop over the <td>
tags using '//td'
, search for the <tr>
tags, grab only the third one, then loop over '//td'
.
Mechanize uses Nokogiri internally, so here's how to do it in Nokogiri-ese:
html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML(html)
pp doc.search('//tr')[2].search('td').map{ |n| n.text }
>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]
Use the .search('//tr')[2].search('td').map{ |n| n.text }
appended to Mechanize's agent.page
, like so:
agent.page.search('//tr')[2].search('td').map{ |n| n.text }
It's been a while since I played with Mechanize, so it might also be agent.page.parser...
.
EDIT:
there will come more rows in the table. The row that i want to scrape is always the second last.
It's important to put that information into your original question. The more accurate your question, the more accurate our answers.
精彩评论