开发者

Find shortest path between two webpages

I need to find the shortest distance between开发者_如何转开发 two Wikipedia pages (in "hops")

I have a method to extract all the internal wiki links on a page

I know the start destination and the end destination but I'm coming up blank on how to extract the hops from the data

so far ive been using the link extraction method to populate a dictionary with the key being the link on the page and the value being the page it was taken off of.

If anyone has any ideas one what a good data structure would be to hold the info and then how to look through it I would appreciate it very much


Do you know anything about graph theory? You have the necessary data to build a graph, but you'll need to use Dijkstra's algorithm to traverse it to find the shortest path between your two points.


Maybe it's a bit stupid since I'm not really a C# programmer but a multidimensional array that contains all the links inside would, depending on the depth of dimensions let you know which way contains less hoops.

It's just a thought while this is certainly doable in theory since there is no language limit to the number of dimensions an array can have, I'm pretty sure it'll be really memory hungry!

Something like this:

[source] -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> [target]
         -> [source link] -> ['source link' link] -> etc


I think the graph is sparse in this case. So it might be a good idea to use something like HashSet for each Wikipedia page, with pages where it links to inside the set.

You don't really need to implement Dijikstra's shortest path algorithm in this case. Since this is equal to shortest path problem where the weight of each edge is equal to 1. You could just do a Breadth-first search and get get the depth where the destination page is found.


Here's a implementation of Dijkstra's algorithm in python: http://code.activestate.com/recipes/119466/


Assuming you have a IEnumerable<Link> PageLinks(Link link)

The number of hops would be solved by the following:

Link curentPage = "somepage";
Link destinationPage = "otherpage";
if (currentPage == destinationPage) return 0;
int hops = 1;
IEnumerable<Link> currentLinks = PageLinks(currentPage);
IEnumerable<Link> visited = new [] {currentPage};
while(!currentLinks.Contains(destinationPage)) 
{
    currentLinks = currentLinks
        .SelectMany(l => PageLinks(l).Where(f => !visited.Contains(f)));
    visited = visited.Union(currentLinks);
    hops++;
}
return hops;

Edited to make faster for cycling, although the algorithm would have worked without it. It may run until StackOverflow or so if the pages aren't linked.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜