Mechanize - Simpliest way to check if page have been updated?
What is the simpiest solution with Mechanize to see if a page have been updated?
I was thinking about create a table named pages.
That would have:
pagename - varchar
page - text
pag开发者_高级运维eupdated - boolean
How should I create the screen scraper and save the data in the database? And how to create an method to compare the html in the table with the scraped data. To check if the page have been updated.
Answer updated and tested.
Here's an example using a Page model (and using retryable-rb):
rails generate scaffold Page name:string remote_url:string page:text digest:text page_updated:boolean
####### app/models/page.rb
require 'digest'
require 'retryable'
class Page < ActiveRecord::Base
include Retryable
# Scrape page before validation
before_validation :scrape_content, :if => :remote_url?
# Will cause save to fail if page could not be retrieved
validates_presence_of :page, :if => :remote_url?, :message => "URL provided is invalid or inaccessible."
# Update digest if/when all validations have passed
before_save :set_digest
# ...
def update_page!
self.scrape_content
self.set_digest
self.save!
end
def page_updated?
self.page_updated
end
protected
def scrape_content
ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X) ' +
'AppleWebKit/535.1 (KHTML, like Gecko) ' +
'Chrome/14.0.835.186 Safari/535.1'
# Using retryable, create scraper and get page
scraper = Mechanize.new{ |i| i.user_agent = ua }
scraped_page = retryable(:times => 3, :sleep => false) do
scraper.get(URI.encode(self.remote_url))
end
self.page_updated = false
self.page = scraped_page.content
self.name ||= scraped_page.title
self.digest ||= Digest.hexencode(self.page)
end
def set_digest
# Create new digest of page content
new_digest = Digest.hexencode(self.page)
# If digest has changed, update digest and set flag
if (new_digest != self.digest) && !self.digest.nil?
self.digest = new_digest
self.page_updated = true
else
self.page_updated = false
end
true
end
end
I'm fairly sure this is an unrelated matter, but I seem to be encountering an LoadError
when trying to require 'mechanize'
in rails console
and my test application. Not sure what's causing this, but I'll update my answer when I'm able to successfully test this solution.
Make sure you remember to add this to your application's Gemfile
:
gem 'mechanize', '2.0.1'
gem 'retryable-rb', '1.1.0'
Usage Example:
p = Page.new(:remote_url => 'http://rubyonrails.org/')
p.save!
p.page_updated? # => false, since page hasn't been updated since creation
p.remote_url = 'http://www.google.com/' # for the sake of example
p.update_page!
p.page_updated? # => true
精彩评论