Check if web page is modifed / has expired with Ruby
I'm writing a crawler for Ruby, and I开发者_Python百科 want to honour the headers that the server sends out in order to make the crawl more efficient. Is there a straightforward way in Ruby of determining whether a page needs to be re-downloaded by the client? I know I need to consider at least these headers:
- Last Modified
- Etags
- Cache Control
- Expires
What's the definitive way of determining this - is it specified anywhere?
You are right on the headers you will need to look at, but you need to consider that the server is what is setting these. If they are set correctly, then you can use them to make the decision, but none of them are required.
Personally, I would probably start with tracking the expires value as I do the initial download, as well as logging the etag. Finally I'd look at last modified as I did the next pass, assuming the expires or etag showed some sign that I might need to re-download (or if they aren't even set). I wouldn't expect Cache Control to be all the useful.
You'll want to read about the head
method in Net::HTTP
-- http://www.ruby-doc.org/stdlib/
精彩评论