开发者

The best way to store a large set of urls for crawler

I'm writing a custom built crawler, and need to know if a specific url is crawled or not, so I won't add the same url twice. Ri开发者_如何学JAVAght now I'm using mysql to store hash values of each url. But I'm wondering if this may become very slow if I have a large set of urls, say, hundreds of millions.

Is there other ways to store urls? Do people use lucene to do this? Or is there specific data structure to do this?


You have not specified your development platform, but there is really good data structure called Trie (http://en.wikipedia.org/wiki/Trie) there are lot of implementation on java, c++, c# ...


you may want to try BerkeleyDb


its too late! but you can use a ram db system for example memcached

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜