Getting all the domains a page depends on using Nokogiri
I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:
- Image URLs <img src="..."
- Javascript URLs <script src="..."
- CSS and any CSS url(...) elements
- Frames and IFrames
I'd also want to follow any CSS imports.
Any suggestions / help would be appreciated. The project is already using Anemone.
Here's what I have at the moment.
Anemone.crawl(site, :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
page.doc.xpath('//img').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//script').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//link').each do |link|
process_dependency(page, link[:href])
end
puts page.url
end
end
Code would be great but I'm rea开发者_运维百科lly just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.
Get the content of the page, then you can extract an array of URIs from the page with
require 'uri'
URI.extract(page)
After that it's just a matter of using a regular expression to parse each link and extract the domain name.
精彩评论