开发者

Ruby Mechanize screen scraping help

I am trying to scrape a row in a table with a date. I want to scrape only the third row that have the date today.

This is my mechanize code. I am trying to select the colum row witch have the date today and its and its columns:

agent.page.sear开发者_如何学Goch("//td").map(&:text).map(&:strip)

Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

"

I want to only scrape the third row that is the date today.


Rather than loop over the <td> tags using '//td', search for the <tr> tags, grab only the third one, then loop over '//td'.

Mechanize uses Nokogiri internally, so here's how to do it in Nokogiri-ese:

html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

pp doc.search('//tr')[2].search('td').map{ |n| n.text }

>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

Use the .search('//tr')[2].search('td').map{ |n| n.text } appended to Mechanize's agent.page, like so:

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

It's been a while since I played with Mechanize, so it might also be agent.page.parser....


EDIT:

there will come more rows in the table. The row that i want to scrape is always the second last.

It's important to put that information into your original question. The more accurate your question, the more accurate our answers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜