best database design for web crawler
many db systems are suitable to work with a web crawler, but is there any db system specifically developed for web crawlers (in .net).
my experience says that a web crawler has many parts and services and each part need some specific features. for example to cache web pages we need some thing l开发者_运维问答ike FILESTREAM of sql server. or to check if a URL already exists in db the best choice is memcached.
in fact I have 2 questions
1) what are best db systems to work with a web crawler?
2) is there any db system that cover all features!!!!!!!!!?
FYI, to my knowledge Google is not using any rational database engine, they rather have a proprietary file system GFS and their own data persistence abstractions.
Who has told you that memcached is the best choice? consider that in case the amount of data is BIIIG you would run out of memory, except of course if you have a big data center and are able to share data across machines in memory...
I think is not about the best choice, the best is probably Google and they have done most of their things in house.
if you can handle being at high level (but still not the best), I think all engines like SQL Server, Oracle, mySQL and many others could perform well, it depends more on how you use them and how you architect your solution.
Google uses a column oriented database BIGTABLE to store its crawler results and also for google docs, other google products which is built on top of GFS (Google File System). Their design is by far the best I know.
Apache HBase is similar in implementaion to Bigtable. HBase is built on top of HDFS (Hadoop Distributed File System).
精彩评论