Parse a Webpage in Ruby to retrieve URLs from it

2023-02-18 04:11 问答作者：

I want to parse a webpage and retrieve the first few embedded urls under certain headers using ruby. For example, I have a document archive in which documents are stored as doc-type.timestamp.ext and I want to pull out all documents of th开发者_如何学Ce same type.

The best solution I found on was this : What is the best way to parse a web page in Ruby?

Is there anyway I can do this without using hpricot and other such packages?

Thanks!

Why do you not want to use an external gem? They can make your life a lot easier, take a a look at this Mechanize example where you can quickly output every link on the page:

require 'rubygems'
require 'mechanize'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

a.get('http://google.com/') do |page|
  p page.links
end

I've been scraping a lot lately and you can not get very far without parsing the page, I use Nokogiri with plain net/http but will switch to Mechanize in the future. Mechanize uses Nokogiri internally as well.

继续阅读：html-parsing ruby

Parse a Webpage in Ruby to retrieve URLs from it

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？