Web Cralwer Algorithm: depth?

2023-01-28 17:25 问答作者：

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial

depth indicates the link depth from the root page that should be crawled.

So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a pro开发者_Python百科blem I think.

Link depth means the number of "hops" a page is be away from the root, where a "hop" means following a link on a page. The reason Nutch has this restriction is that links very "far away" from the main page are unlikely to hold much information (the main page will link to the most important information, so the farther you get, the more detailed info you find), while there can be very many of them, so they take up lots of storage space, computing time for ranking, and bandwidth.

Nutch thus uses an algorithm scheme known as depth-limited search to bound its running time and space usage. If it didn't use this heuristic, it would have to crawl an entire site to rank all the pages in it and find the top N.

To crawl to depth 3, implement this algorithm and give it a depth bound of three. The nice thing about depth-limited search is that it's a variant of depth-first search (DFS), so it's quite space-efficient:

function depth-limited-crawl(page p, int d)
    if d == 0
        return
    /* do something with p, store it or so */
    foreach (page l in links(p))
        depth-limited-crawl(linked, d-1)

And no, a site cannot in general be represented as a binary tree; it's a directed graph. If you somehow remove the backlinks, then it becomes a multiway tree. Either way, many sites are too large to store for your crawler.

Make a list that you use as a queue.
Append www.domain.com, depth 0 to it
Pull the first element off it
current depth is the elements depth+1
Crawl that site
Append each link on the site to the queue if the current depth isn't greater than the maximum depth
If the list isn't empty, go back to 3..

I guess the "depth" is the number of times the crawler "follows a link".

Say you start from the root page. You follow each of the links on this page: this is depth 1. For each of the target pages, you follow the links: this is depth 2, etc.

Note that there may be "cycles" while following links. The structure is not a tree, but a graph.

Link depth means how many links you have to follow before you reach a given link.

Example: example.com links to example.com/foo.html which links to google.com. Therefore google.com has a link depth of 2 relative to example.com as you can reach it following 2 links.

To crawl example.com to a depth of 3, you would follow links to a maximum depth of 3 and then stop following links. Without this restriction, you could easily go on forever.

Example: example.com links to example.com/foo.html and example.com/bar.html. You follow those two links, the links they link to and stop there.

Note: The root page has a depth of 0.

A web site's root is at depth 0. Documents you can reach by using links in the root are at depth 1. Documents you can in turn reach from links in documents at depth 1 would be at depth 2. And so on.

Depending on your crawler this might apply to only documents in the same site/domain (usual) or documents hosted elsewhere.

Most web sites are not representable by a binary tree, as the "root" might have more than two "nodes".

Well in the case of Nutch, the depth argument is quite a misnommer, it just means the number of loops the crawler is going through. So you will reach pages which are 3 links away from your seed urls... on a given site it might in depth 3... that is if they hand up being within the top N limits.

Depth is the number of slashes it the url path

example http://www.google.com/foo/bar/baz has a depth of 3

def depth(self,link):
    return len(urlparse.urlparse(url).path.split("/")) -1

继续阅读：algorithm nutch web-crawler

Web Cralwer Algorithm: depth?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？