Huge Graph Structure

2022-12-29 16:36 问答作者：

I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges per node) in memory. The edges representation will contain some attributes of the relation.

I have tried a memory map representation, arrays, dictionaries and strings to represent that structu开发者_如何学JAVAre in memory, but these always crash because of the memory limit.

I would like to get an advice of how I can represent this, or something similar.

By the way, I'm using python.

If that is 100-600 edges/node, then you are talking about 3.6 billion edges.
Why does this have to be all in memory?
Can you show us the structures you are currently using?
How much memory are we allowed (what is the memory limit you are hitting?)

If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.

Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:

Neo4j - claims to easily handle billions of nodes and its been in development a long time.
FlockDB - newly released by Twitter this is a distributed graph database.

Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?

I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:

Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).

Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB

That's without any other overheads or any additional data in the nodes (or edges).

You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.

The edges can be stored using an adjacency list on the nodes.

Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.

The scipy.sparse.csgraph package may be able to handle this -- 5 million nodes * 100 edges on average is 500 million pairs, at 8 bytes per pair (two integer IDs) is just about 4GB. I think csgraph uses compression so it will use less memory than that; this could work on your laptop.

csgraph doesn't have as many features as networkx but it uses waaay less memory.

Assuming you mean 600 per node, you could try something like this:

import os.path
import cPickle
class LazyGraph:
    def __init__(self,folder):
        self.folder = folder

    def get_node(self,id):
        f = open(os.path.join(self.folder,str(id)),'rb')
        node = cPickle.load(f)
        f.close() # just being paranoid
        return node

    def set_node(self,id,node):
        f = open(os.path.join(self.folder,str(id)),'wb')
        cPickle.dump(node,f,-1) # use highest protocol
        f.close() # just being paranoid

Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.

Note, this will be very very slow.

You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.

Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.

If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.

继续阅读：data-structures graph memory python

Huge Graph Structure

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？