Compare URIs for a search bot?
For a search bot, I am working on a design to:
* compare URIs and * determine which URIs are really the same pageDealing with redirects and aliases:
Case 1: Redirects Case 2: Aliases e.g. www Case 3: URL parameters e.g. sukshma.net/node#parameterI have two approaches I could follow, one approach is to explicitly check for开发者_运维技巧 redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)
I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.
Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?
Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.
The first case would be solved by simply checking the HTTP status code.
For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.
精彩评论