Parse a Webpage in Ruby to retrieve URLs from it
I want to parse a webpage and retrieve the first few embedded urls under certain headers using ruby. For example, I have a document archive in which documents are stored as doc-type.timestamp.ext and I want to pull out all documents of th开发者_如何学Ce same type.
The best solution I found on was this : What is the best way to parse a web page in Ruby?
Is there anyway I can do this without using hpricot and other such packages?
Thanks!
Why do you not want to use an external gem? They can make your life a lot easier, take a a look at this Mechanize example where you can quickly output every link on the page:
require 'rubygems'
require 'mechanize'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
a.get('http://google.com/') do |page|
p page.links
end
I've been scraping a lot lately and you can not get very far without parsing the page, I use Nokogiri with plain net/http but will switch to Mechanize in the future. Mechanize uses Nokogiri internally as well.
精彩评论