开发者

Check if web page is modifed / has expired with Ruby

I'm writing a crawler for Ruby, and I开发者_Python百科 want to honour the headers that the server sends out in order to make the crawl more efficient. Is there a straightforward way in Ruby of determining whether a page needs to be re-downloaded by the client? I know I need to consider at least these headers:

  • Last Modified
  • Etags
  • Cache Control
  • Expires

What's the definitive way of determining this - is it specified anywhere?


You are right on the headers you will need to look at, but you need to consider that the server is what is setting these. If they are set correctly, then you can use them to make the decision, but none of them are required.

Personally, I would probably start with tracking the expires value as I do the initial download, as well as logging the etag. Finally I'd look at last modified as I did the next pass, assuming the expires or etag showed some sign that I might need to re-download (or if they aren't even set). I wouldn't expect Cache Control to be all the useful.


You'll want to read about the head method in Net::HTTP -- http://www.ruby-doc.org/stdlib/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜