开发者

Crawlers that crawl websites with authentication

Are there any open source crawlers that would crawl websites that require authentication (username/password) to login?? I need it to crawl my college website for indexing the documents present in the website..Any h开发者_StackOverflow社区elp is appreciated.


None that I know of, and if there were, your systems admin guy would probably not allow it.

You could look for example of a basic crawler and do this yourself though...


You can write a script based on PHP/libCurl or Ruby/Curb. The authentication of the website is based on cookie, and Curl library provide functionary to send cookie in your program.

I don't know which language do you prefer (PHP or Ruby). If you are using Ruby, you can write code as bellow

require 'curb'
require 'uri'
curl = Curl::Easy.new
curl.url = 'http://example.com/login/page'
curl.enable_cookies = true
curl.cookiefile = '/tmp/cookie'
curl.cookiejar = '/tmp/cookie'
form_field = URI.encode_www_form('username'=>yourname, 'password'=>yourpwd)
curl.http_post(form_field)

The file '/tmp/cookie' is used to store and read the cookie like browser. Cookie makes the authentication possible.

The 'form_field' contains the user name and password of the website, but some other fields should be needed based the website. You should hack the login form of the website to know what fields must be POSTed to the website.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜