Rake task for scraping with rails
I'm starting to write scrapers to get data from different websites. I built the first scraper in a rake file and am now starting to write a second rake file to get data from a second site. For now, I am writing a scraper specific to each site I'm interested in (not trying to build a generic scraper).
I have 3 questions:
Is writing rake tasks a good choice for me? Are there alternatives I should consider?
How can I add functions/methods to my rake files? (sorry, very silly questions, but I can't figure out how开发者_Go百科 to structure my code... so for now it's just 500 lines of uninterrupted code in a long method) for instance, I'd like a "get_description(section)" method that returns the description from the page. The method could be different depending on which site I'm scraping.
How can I test my task with RSpec? I'd like to give a link and make sure the output of my tasks matches what I expect to get
Thanks for your help!
As a general principle, rake tasks should be very minimal. Refer the actual behavior to real classes. These classes can then be easily tested.
Example:
task :scrape do
Scraper.scrape!
end
class Scraper
def self.scrape!
# do something
end
end
describe Scraper do
# your tests
end
You could, as @brad indicated, use thor, which has a regular class structure by itself, so in theory it should be easier to test the tasks themselves. I haven't done that though.
You can define methods in rake, but I don't know where they end up. You shouldn't do that, so don't bother. Keep task bodies minimal, write normal code to do the dirty work.
Sure rake is fine if you want to use it, you can also check out thor which uses more standard ruby-like syntax rather than the dsl rake provides you.
Rake is just another ruby library so you can include whatever you like in there. As such you can write your own library and load it in your rake file. Check out how Bundler does it for instance. They've just defined their own classes, then created tasks inside of it. It uses thor by the way, which, from what I can gather somehow proxies those tasks on to rake, haven't really looked through it thoroughly though so i could be wrong.
If you're defining things in your own library, just use rspec as you normally would for any other project, then hook that library into rake or thor with whatever means and you're off to the races
精彩评论