Will a crawler work on this server configuration?
I am building a small crawler as a hobby project. All I want to do is crawl around a million pages and store them in a database. (yes it will be updated time to time, but entries at any particular time will be 1 million only) Just to know how these things work.
I want to code it in PHP/MySQL. I don't want any search capabilities as I don't have server resources to开发者_运维技巧 provide that. All I want is, I should be able to run few SQL queries on database by myself.
In database I won't be storing any Page text (that I want to be stored in separate txt files - I don't know if it will be feasible). Only title, link and some other information will be stored. So basically, if I run a query and it gives me some results I can pull the text data from these files.
Would like to know if this design will be feasible in following environment.
I will be purchasing a VPS from Linode (512 MB RAM) (I can't go for dedicated server, and shared hosts won't let me do this).
My Question: Will it be able to sustain this big database (1 million rows) with ability to run queries in batch mode when required.
Any kind of suggestions welcome. Any other hosting option will also be appreciated.
Writing a web crawler from scratch is a considerable undertaking, at least if you wish to crawl millions of pages. I know this from personal experience on the Heritrix web crawler.
You may benefit from reading the "Overview of the crawler" chapter from the Heritrix developer guide. That chapter covers the high level design and should help you figure out the basic components of a crawler.
Simply put this boils down to 'crawl state' and 'processing'. The crawl state is URLs you´ve seen, URLs you've crawled etc. While processing covers the fetching of an URL and subsequent processing to extract links, save downloaded data etc. Multiple processing threads are typically run in parallel.
You could also try Scrapy. It's fast, and it'll work fine on a Linode 512M server, but it's written in Python.
精彩评论