开发者

Compare URIs for a search bot?

For a search bot, I am working on a design to:

* compare URIs and

* determine which URIs are really the same page

Dealing with redirects and aliases:

Case 1: Redirects

Case 2: Aliases e.g. www

Case 3: URL parameters e.g. sukshma.net/node#parameter

I have two approaches I could follow, one approach is to explicitly check for开发者_运维技巧 redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)

I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.

Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?


Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.


The first case would be solved by simply checking the HTTP status code.

For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜