Will a crawler work on this server configuration?

2023-02-24 12:05 问答作者：

I am building a small crawler as a hobby project. All I want to do is crawl around a million pages and store them in a database. (yes it will be updated time to time, but entries at any particular time will be 1 million only) Just to know how these things work.

I want to code it in PHP/MySQL. I don't want any search capabilities as I don't have server resources to开发者_运维技巧 provide that. All I want is, I should be able to run few SQL queries on database by myself.

In database I won't be storing any Page text (that I want to be stored in separate txt files - I don't know if it will be feasible). Only title, link and some other information will be stored. So basically, if I run a query and it gives me some results I can pull the text data from these files.

Would like to know if this design will be feasible in following environment.

I will be purchasing a VPS from Linode (512 MB RAM) (I can't go for dedicated server, and shared hosts won't let me do this).

My Question: Will it be able to sustain this big database (1 million rows) with ability to run queries in batch mode when required.

Any kind of suggestions welcome. Any other hosting option will also be appreciated.

Writing a web crawler from scratch is a considerable undertaking, at least if you wish to crawl millions of pages. I know this from personal experience on the Heritrix web crawler.

You may benefit from reading the "Overview of the crawler" chapter from the Heritrix developer guide. That chapter covers the high level design and should help you figure out the basic components of a crawler.

Simply put this boils down to 'crawl state' and 'processing'. The crawl state is URLs you´ve seen, URLs you've crawled etc. While processing covers the fetching of an URL and subsequent processing to extract links, save downloaded data etc. Multiple processing threads are typically run in parallel.

You could also try Scrapy. It's fast, and it'll work fine on a Linode 512M server, but it's written in Python.

继续阅读：database hosting php web-crawler

Will a crawler work on this server configuration?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？