开发者

Parse a Webpage in Ruby to retrieve URLs from it

I want to parse a webpage and retrieve the first few embedded urls under certain headers using ruby. For example, I have a document archive in which documents are stored as doc-type.timestamp.ext and I want to pull out all documents of th开发者_如何学Ce same type.

The best solution I found on was this : What is the best way to parse a web page in Ruby?

Is there anyway I can do this without using hpricot and other such packages?

Thanks!


Why do you not want to use an external gem? They can make your life a lot easier, take a a look at this Mechanize example where you can quickly output every link on the page:

require 'rubygems'
require 'mechanize'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

a.get('http://google.com/') do |page|
  p page.links
end

I've been scraping a lot lately and you can not get very far without parsing the page, I use Nokogiri with plain net/http but will switch to Mechanize in the future. Mechanize uses Nokogiri internally as well.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜