Best practices for using URL as a database key

2023-02-24 11:23 问答作者：

I'm going to be writing a crawler, storing results in a database (MongoDB).

Of course, using the URL as one possible query parameter is important. But, it's also problematic:

URLs can be very long, and MongDB has a finite maximum key length
There are lots of content synonyms, and you don't know this by craw开发者_运维问答ling just one page.
What to do for HTTP 301, 302, 303, 307, etc. Store the original URL or the new location? This is especially an issue for link shorteners.
"The last.fm" problem. lastfm.com == last.fm ~= lastfm.it (etc.) and the site doesn't use a 30x result code to indicate. It just serves the content from multiple domains.

Goals for this database:

Given any URL that may or may not be in the database, let me query to find out if I've previously crawled that document before, with reasonable accuracy.

Of course, any scheme other than "just go crawl it and store the exact URL not worrying about duplicates" will have some amount of false positives. A false positive would be a URL that I think is the same as one previously crawled, but is actually different.

I think by default, your key can be something like 1000 bytes. Are you really going to have urls larger than that? Worst comes to worse, I'm pretty sure this is a hardcoded constant that you could change.

On your other points:

There are lots of content synonyms, and you don't know this by crawling just one page. - Huh? Do you mean that a site might be duplicated, with only nuanced differences in content focused around keyphrases and you want to avoid indexing those?

What to do for HTTP 301, 302, 303, 307, etc. Store the original URL or the new location? This is especially an issue for link shorteners. - I would think the destinations...what if someone has shortened the same destination multiple times? What if the shortened link expires, or the shortener is taken offline? I would think those are far more likely than the same thing happening with the destination url.

"The last.fm" problem. lastfm.com == last.fm ~= lastfm.it (etc.) and the site doesn't use a 30x result code to indicate. It just serves the content from multiple domains. - Could you write a simple algorithm to check domains that might be similar? Last.fm contains 6/9 of the characters lastfm.com does, and the first 6 are identical. If you were to also store a bit of meta data, you could check to see if a match with a high level of relevance may be an identical document.

Given any URL that may or may not be in the database, let me query to find out if I've previously crawled that document before, with reasonable accuracy. - See last point

Hope this helps!

继续阅读：database-design http url-shortener

Best practices for using URL as a database key

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？