The best way to store a large set of urls for crawler
I'm writing a custom built crawler, and need to know if a specific url is crawled or not, so I won't add the same url twice. Ri开发者_如何学JAVAght now I'm using mysql to store hash values of each url. But I'm wondering if this may become very slow if I have a large set of urls, say, hundreds of millions.
Is there other ways to store urls? Do people use lucene to do this? Or is there specific data structure to do this?
You have not specified your development platform, but there is really good data structure called Trie (http://en.wikipedia.org/wiki/Trie) there are lot of implementation on java, c++, c# ...
you may want to try BerkeleyDb
its too late! but you can use a ram db system for example memcached
精彩评论