开发者

Mechanize - Simpliest way to check if page have been updated?

What is the simpiest solution with Mechanize to see if a page have been updated?

I was thinking about create a table named pages.

That would have:

pagename - varchar
page - text
pag开发者_高级运维eupdated - boolean

How should I create the screen scraper and save the data in the database? And how to create an method to compare the html in the table with the scraped data. To check if the page have been updated.


Answer updated and tested.

Here's an example using a Page model (and using retryable-rb):

rails generate scaffold Page name:string remote_url:string page:text digest:text page_updated:boolean

####### app/models/page.rb

require 'digest'
require 'retryable'

class Page < ActiveRecord::Base
  include Retryable

  # Scrape page before validation
  before_validation :scrape_content, :if => :remote_url?

  # Will cause save to fail if page could not be retrieved
  validates_presence_of :page, :if => :remote_url?, :message => "URL provided is invalid or inaccessible."

  # Update digest if/when all validations have passed
  before_save :set_digest

  # ...

  def update_page!
    self.scrape_content
    self.set_digest
    self.save!
  end

  def page_updated?
    self.page_updated
  end

  protected

  def scrape_content
    ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X) ' + 
         'AppleWebKit/535.1 (KHTML, like Gecko) ' + 
         'Chrome/14.0.835.186 Safari/535.1'

    # Using retryable, create scraper and get page
    scraper = Mechanize.new{ |i| i.user_agent = ua }
    scraped_page = retryable(:times => 3, :sleep => false) do
      scraper.get(URI.encode(self.remote_url))
    end
    self.page_updated = false
    self.page = scraped_page.content
    self.name ||= scraped_page.title
    self.digest ||= Digest.hexencode(self.page)
  end

  def set_digest
    # Create new digest of page content
    new_digest = Digest.hexencode(self.page)

    # If digest has changed, update digest and set flag
    if (new_digest != self.digest) && !self.digest.nil?
      self.digest = new_digest
      self.page_updated = true
    else
      self.page_updated = false
    end

    true
  end

end

I'm fairly sure this is an unrelated matter, but I seem to be encountering an LoadError when trying to require 'mechanize' in rails console and my test application. Not sure what's causing this, but I'll update my answer when I'm able to successfully test this solution.

Make sure you remember to add this to your application's Gemfile:

gem 'mechanize', '2.0.1'
gem 'retryable-rb', '1.1.0'

Usage Example:

p = Page.new(:remote_url => 'http://rubyonrails.org/')
p.save!
p.page_updated? # => false, since page hasn't been updated since creation
p.remote_url = 'http://www.google.com/' # for the sake of example
p.update_page!
p.page_updated? # => true
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜