Huge Graph Structure
I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges per node) in memory. The edges representation will contain some attributes of the relation.
I have tried a memory map representation, arrays, dictionaries and strings to represent that structu开发者_如何学JAVAre in memory, but these always crash because of the memory limit.
I would like to get an advice of how I can represent this, or something similar.
By the way, I'm using python.
- If that is 100-600 edges/node, then you are talking about 3.6 billion edges.
- Why does this have to be all in memory?
- Can you show us the structures you are currently using?
- How much memory are we allowed (what is the memory limit you are hitting?)
If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.
Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:
- Neo4j - claims to easily handle billions of nodes and its been in development a long time.
- FlockDB - newly released by Twitter this is a distributed graph database.
Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?
I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:
Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).
Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB
That's without any other overheads or any additional data in the nodes (or edges).
You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.
The edges can be stored using an adjacency list on the nodes.
Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.
The scipy.sparse.csgraph package may be able to handle this -- 5 million nodes * 100 edges on average is 500 million pairs, at 8 bytes per pair (two integer IDs) is just about 4GB. I think csgraph
uses compression so it will use less memory than that; this could work on your laptop.
csgraph doesn't have as many features as networkx but it uses waaay less memory.
Assuming you mean 600 per node, you could try something like this:
import os.path
import cPickle
class LazyGraph:
def __init__(self,folder):
self.folder = folder
def get_node(self,id):
f = open(os.path.join(self.folder,str(id)),'rb')
node = cPickle.load(f)
f.close() # just being paranoid
return node
def set_node(self,id,node):
f = open(os.path.join(self.folder,str(id)),'wb')
cPickle.dump(node,f,-1) # use highest protocol
f.close() # just being paranoid
Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.
Note, this will be very very slow.
You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.
Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.
If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.
精彩评论