Graph Database to Count Direct Relations

2023-03-25 12:49 问答作者：

I'm trying to graph the linking structure of a web site so I can model how pages on a given domain link to each other. Note I'm not graphing links to sites not on the root domain.

Obviously this graph could be considerable in size. One of the main queries I want to perform is to count how many pages directly link into a given url. I want to run this against the whole graph (shudder) such that I end up with a list of urls and the count of incoming links to that url.

I know one popular way of doing this would be via some kind of map reduce - and I may still end up going that way - however I have a requirement to be able to view this report in (near) realtime which isn't generally map reduce friendly.

I've had a quick look at neo4j and O开发者_如何学编程rientDb. While both of these could model the relationship I want it's not clear if I could query them to generate the report I want. At this point I'm not committed to any particularly technology.

Any help would be greatly appreciated. Thanks, Paul

both OrientDB and Neo4J supports Blueprints as common API to make graph operations like traversal, counting, etc.

If I've understood well your use case your graph seems pretty simple: you have a "URL" Vertex that links each other with one type of Edge "Links".

To execute operation against graphs take a look at Gremlin.

You might have a look at structr. It is a open source CMS running on top of Neo4j and exactly has those types of inter-page links.

For getting the number of links pointing to the page you just have to iterate the incoming LINKS_TO links for the current page-node.

What is the use-case for your query ? A popular page list? So it would just contain the top-n pages? You might then try to just start at random places of the graph traverse incoming LINKS_TO relationships to your current node(s) in parallel and put them into a sorting structure, so you always start/continue with the first 20 or so top page-nodes that already have the highest number of incoming links (until they're finished).

Marko Rodriguez has some similar "page-rank" examples in the Gremlin documentation. He's also got several blog posts where he talks about this.

Well with Neo4J you won't be able to split the graph across servers to distribute the load. you could replicate the database to distribute the computation, but then updating will be slow (as you have to replicate the updates). I would attack the problem by updating a count of inbound links to each node as new relationships are added as a property of the node. Neo4J has excellent write performance. Of course you don't need to persist this information because direct relationships are cheap to retrieve (you don't get a collection of all related nodes just an iterator).

You should also take a look at a highly scalable graph database product, such as InfiniteGraph. If you email their technical support I think they will be able to point you at some sample code that does a large part of what you've described here.

继续阅读：database graph neo4j orientdb

Graph Database to Count Direct Relations

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？