Efficient way to store a graph for calculation in Hadoop
I am currently trying to perform calculations like clustering coefficient on huge graphs with the help of Hadoop. Therefore I need an efficient way to store the graph in a way that I can easily access nodes, their neig开发者_如何学JAVAhbors and the neighbors' neighbors. The graph is quite sparse and stored in a huge tab separated file where the first field is the node from which an edge goes to the second node in field two.
Thanks in advance!
The problem with storing a graph directly in HDFS is that you have no means to perform random reads of the data. So to find all the neighbors of a node you have to process the whole edge list in HDFS to find the nodes that are connected to it.
So to perform a clustering coefficient calculation you would need to pass over all the data twice. The first time finding the nodes that are connected to the starting node. The second time to find out how those nodes are connected to each other.
Each time you want to go out another level in your graph you will therefore need to process the whole graph to find the new connections.
Is this an easy thing to do, well yes it is. Is it time efficient? That really depends on how fast you wish to be able to calculate things like the LCC and how large your graph actually is. It won't be anywhere near real time.
Another approach would be to use HBase to store your edges in some fashion, this would give you random access to nodes still in a parallel fashion. After all HBase is part of hadoop.
Something that might be of interest if you want to store large graphs in a parallel fashion might be FlockDB. Its a distributed graph database released recently by Twitter. I haven't used it but it might be worth a look.
If you want to do this on a user-by-user basis, HBase/Cassandra might work. Store the edges in a column family: user_a_id is row key, user_b_id are the column keys (with blank values). FlockDB isn't a good fit (they expressly cite "graph-walking queries" as a non-goal)
If you'd lilke to calculate the clustering coefficient across the entire graph -- that is, to do one giant efficient computation -- I'd use Hadoop. With some caveats (see below) you can do this quite straightforwardly; at infochimps we've used Wukong on a strong-link twitter graph with millions of nodes+edges.
What won't work is to naively do a 2-hop breadth-first search out from every node if your dataset has high skew. Thinking about the Twitter follow graph: the 1.7M people who follow @wholefoods have 600k outbound edges to contend with, for 1 trillion 2-hops. Using strong links makes this much easier (vastly reduces the skew); otherwise, do some partial clustering and iterate.
精彩评论