开发者

Web Crawler Application

can anyone recommend a website crawler that can show me all of the开发者_如何学运维 links in my site?


W3C has the best one I've found

http://validator.w3.org/checklink


Xenu is the best link checker tool I have found. It will check all links and then give you an option to view them or export them. It is free, you can download it http://home.snafu.de/tilman/xenulink.html from their site.


As long as you are the owner of the site (i.e. you have all the files), Adobe Dreamweaver can generate a report of all your internal & external hyperlinks, and report all broken links (orphan files as well). But, you have to set up your site in Dreamweaver first.


If you need to do any post-processing of the links, I'd recommend any of the many variants of Mechanize.

In Ruby:

require "rubygems"
require "mechanize"
require "addressable/uri"

processed_links = []
unprocessed_links = ["http://example.com/"] # bootstrap list
a = WWW::Mechanize.new
until unprocessed_links.empty?
  # This could take awhile, and depending on your site,
  # it may be an infinite loop.  Adjust accordingly.
  processed_links << unprocessed_links.shift
  a.get(processed_links.last) do |page|
    page.links.each do |link|
      link_uri = Addressable::URI.parse(link).normalize
      # Ignore external links
      unprocessed_links << link_uri.to_str if link_uri.host == "example.com"
    end
  end
end

Something to that effect.


Larbin ... takes a little C++ coding, but is the perfect performant web crawler foundation and can be used to do basically everything from linkwalking to indexnig to data acquisition.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜