Best practices for using URL as a database key
I'm going to be writing a crawler, storing results in a database (MongoDB).
Of course, using the URL as one possible query parameter is important. But, it's also problematic:
- URLs can be very long, and MongDB has a finite maximum key length
- There are lots of content synonyms, and you don't know this by craw开发者_运维问答ling just one page.
- What to do for HTTP 301, 302, 303, 307, etc. Store the original URL or the new location? This is especially an issue for link shorteners.
- "The last.fm" problem. lastfm.com == last.fm ~= lastfm.it (etc.) and the site doesn't use a 30x result code to indicate. It just serves the content from multiple domains.
Goals for this database:
- Given any URL that may or may not be in the database, let me query to find out if I've previously crawled that document before, with reasonable accuracy.
Of course, any scheme other than "just go crawl it and store the exact URL not worrying about duplicates" will have some amount of false positives. A false positive would be a URL that I think is the same as one previously crawled, but is actually different.
I think by default, your key can be something like 1000 bytes. Are you really going to have urls larger than that? Worst comes to worse, I'm pretty sure this is a hardcoded constant that you could change.
On your other points:
There are lots of content synonyms, and you don't know this by crawling just one page. - Huh? Do you mean that a site might be duplicated, with only nuanced differences in content focused around keyphrases and you want to avoid indexing those?
What to do for HTTP 301, 302, 303, 307, etc. Store the original URL or the new location? This is especially an issue for link shorteners. - I would think the destinations...what if someone has shortened the same destination multiple times? What if the shortened link expires, or the shortener is taken offline? I would think those are far more likely than the same thing happening with the destination url.
"The last.fm" problem. lastfm.com == last.fm ~= lastfm.it (etc.) and the site doesn't use a 30x result code to indicate. It just serves the content from multiple domains. - Could you write a simple algorithm to check domains that might be similar? Last.fm contains 6/9 of the characters lastfm.com does, and the first 6 are identical. If you were to also store a bit of meta data, you could check to see if a match with a high level of relevance may be an identical document.
Given any URL that may or may not be in the database, let me query to find out if I've previously crawled that document before, with reasonable accuracy. - See last point
Hope this helps!
精彩评论