Construct URLs after scraping for image paths

2023-03-03 10:31 问答作者：

I'm trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I'm using Nokogiri for scraping and I want to know if there is anything I can use to easily process the unpredicatble URLs provided by user and image paths scraped short of figuring out how to write something from scratch.

Examples:

http://domain.com/ and /system/images/image.png
=> http://domain.com/system/images/image.png

http://sub.domain.com and images/common/image.png
=> http://sub.domain.com/images/common/image.png

http://domain.com/dir/ and images/image.png
=> http://domain.com/dir/images/image.png

http://domain.com/dir and /images/small/image.png
=> http://domain.com/images/small/image.png

http://domain.com and http://s3.amazon-aws.com/bucket/image.png
=>开发者_JS百科; http://s3.amazon-aws.com/bucket/image.png

Instead of downloading the pages and using Nokogiri, I would recommend using Mechanize. It is built on top of Nokogiri, so everything you can do with Nokogiri you can do with Mechanize, but it adds a lot of useful functionality for scraping/navigating. It will take care of the relative URL problem you describe above.

require 'rubygems'
require 'mechanize'
url='http://stackoverflow.com/questions/5903218/construct-urls-after-scraping-for-image-paths/5903417'
Mechanize.new.get(url) {|page| puts page.image_urls.join "\n"}

If you really want to do it yourself (instead of using Mechanize, say), use URI::join:

require 'uri'
URI::join("http://domain.com/dir", "/images/small/image.png")
  # => http://domain.com/images/small/image.png

Note that you have to respect the HTML page's BASE tag if there is one...

继续阅读：parsing ruby ruby-on-rails

Construct URLs after scraping for image paths

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？