开发者

What are the best ways to store Graphs in persistent storage

I am wondering what the best ways to store graphs in persistent storage are, for later analysis, search, clustering, etc.

I see开发者_如何学JAVA neo4j being an option, I am curious if there are also other graph databases available. Does anyone have any insights into how larger social networks store their graph based data (or other sites that require the storage of graph like models, e.g. RDF).

What about options like Cassandra, or MySQL?


Graph Databases:

  1. HyperGraphDB: a general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism.
  2. InfoGrid: an Internet Graph Database with a many additional software components that make the development of REST-ful web applications on a graph foundation easy.
  3. vertexdb: a high performance graph database server that supports automatic garbage collection.

Source: http://nosql.mypopescu.com/post/498705278/quick-review-of-existing-graph-databases

Graph Libraries:

  1. WebGraph is a framework to study the web graph. From their page - "It provides simple ways to manage very large graphs, exploiting modern compression techniques."
  2. Dex is a high performance library to manage very large graphs or networks.
  3. This blog post - On Building a Stupidly Fast Graph Database - provides some guidelines on building a graph database - the technique they use is "memory-mapped I/O, disk-based linear-hashing".


Disclaimer: I am speaking form the graph analysis standpoint.

There are several file formats for storing graph data: GraphML, GXL and several others. But storage usually is not a problem. Working with the graphs without fully loading them into RAM is the tricky part.

The RDF model is too generic to do serious graph analysis stuff. If you don't mind your analysis being slow and programming the algorithms yourself, go with the existing graph databases - see wikipedia on this.

For real analysis, load all data into RAM using existing graph analysis libraries, like SNAP or see This question.


There is no absolutely correct answer here; there is a large variety of options, the choice of which seriously depends on your needs. With large-scale retrievals/traversals (e.g. social networks and similar back-ends) you're quickly going to run into the random I/O bottleneck; I believe storing your graph in RAM is currently the only practical course of action. Less latency-sensitive applications have quite a wide variety of options, including neo4j (open source with a commercial flavor) and Allegrograph (commercial with a limited free edition).

At Delver we ended up implementing our own denormalized data model (essentially an adjacency list to represent the graph) in RAM on top of GigaSpaces (some info can be found in this presentation), with custom map-reduce code for queries and data analysis. If you go this route, Cassandra seems to be a viable open source platform to build on.


You could look at InfiniteGraph, which will be released for beta very soon (http://www.infinitegraph.com/)

If this is for commercial use then you'll see it's targeted towards sites that will have larger graphs. The social networking sites built custom solutions, which worked for them at the time. But they're in-house solutions are more limiting than using something like InfiniteGraph. Products like Cassandra or MySQL weren't designed for this many-to-many problem set. Can you do it? Sure, but it's a lot of hand-written coding, and not scalable. Let us know if you have a real project, we could help you figure out you graph requirements. Thanks, Warren wdavidson@objectivity.com

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜