best database design for web crawler

2023-03-17 13:55 问答作者：

many db systems are suitable to work with a web crawler, but is there any db system specifically developed for web crawlers (in .net).

my experience says that a web crawler has many parts and services and each part need some specific features. for example to cache web pages we need some thing l开发者_运维问答ike FILESTREAM of sql server. or to check if a URL already exists in db the best choice is memcached.

in fact I have 2 questions

1) what are best db systems to work with a web crawler?

2) is there any db system that cover all features!!!!!!!!!?

FYI, to my knowledge Google is not using any rational database engine, they rather have a proprietary file system GFS and their own data persistence abstractions.

Who has told you that memcached is the best choice? consider that in case the amount of data is BIIIG you would run out of memory, except of course if you have a big data center and are able to share data across machines in memory...

I think is not about the best choice, the best is probably Google and they have done most of their things in house.

if you can handle being at high level (but still not the best), I think all engines like SQL Server, Oracle, mySQL and many others could perform well, it depends more on how you use them and how you architect your solution.

Google uses a column oriented database BIGTABLE to store its crawler results and also for google docs, other google products which is built on top of GFS (Google File System). Their design is by far the best I know.

Apache HBase is similar in implementaion to Bigtable. HBase is built on top of HDFS (Hadoop Distributed File System).

继续阅读：.net database performance web-crawler

best database design for web crawler

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？