Efficient database query for ancestors on an acyclic directed graph
Let's say I have an acyclic directed graph such as a family "tree" (not really a tree since a child has 2 parents). I want to place a representation of this graph in a relational database so that it's fast to compute all ancestors of a node, and all descendants of a node. How would you represent this graph? How would you query for all descendants? How would you insert and remove nodes and re开发者_JAVA技巧lationships? What assumptions are you making about the data?
The best solution will have the best big O for the number of select/insert/delete
statements you run to query ancestors and descendants, with ties broken by best big O for total runtime, with ties broken by space requirements.
My coworker posed this question to me. I have a solution but it's exponential size in the worst case so I wanted to see how other people would solve it.
edit
Clarified relational database. This question is trivial (and boring) if you use graph databases with built in transitive closures.
If selects
> manipulations
, and especially subtree selects (all ancestors, all descendants) I'd go for a Closure-table approach. Yes, an explosion of paths in your path-table, but it does deliver results fast (as opposed to the adjacency model), and keeps updates limited to relevant portions (as opposed to 50% update with nested sets).
Bill Karwin has some nice presentation online about pros and cons of different models, see http://www.slideshare.net/billkarwin/models-for-hierarchical-data (slide 48 is an overview).
For DAGs in SQL databases there appeared to be only two solutions:
Recursive WITH clause.
Transitive closure
I'm not aware of any practical graph labeling scheme (like nested sets,intervals or materialized path)
RDBMS:s aren't really designed to handle this kind of data. The obvious choice is to use a graph database instead, then there's no need to translate the graph into a different representation, you use an graph API all the way. There's a good presentation by Marko Rodriguez explaining the impact of the underlying data model when dealing with graph traversals, see The Graph Traversal Programming Pattern if you want to look deeper into that.
I wrote up a simple example of handling DAGs with the Neo4j graph database a while ago which may be useful to you.
"How would you represent this graph?"
- VAR NODES RELATION{node:sometype} KEY {node};
- VAR EDGES RELATION{parentNode:sometype childNode:sometype} KEY {parentNode childNode};
- CONSTRAINT NO_CYCLES IS_EMPTY(TCLOSE(EDGES) WHERE parentNode=childNode);
"How would you query for all descendants?"
TCLOSE(EDGES) WHERE parentNode=somevalue;
"How would you insert and remove nodes and relationships?"
- INSERT INTO EDGES RELATION{TUPLE{parentNode somevalue chlidNode somevalue}};
- DELETE EDGES WHERE deleteCondition;
"What assumptions are you making about the data?"
What kind of assumptions are there to make ? You have specified everything there is to specify by saying "directed acyclic graph".
In a relational database I would store for each node :
- father
- childs
- ancestors
With index on everything and full-index on ancestors
Request for :
- all ancestors :
- O(log n) (find the node then you are done)
- all descendants :
- O(full-index search on ancestors) (depends of database)
- add new node /delete node (with no childs) :
- O(1) for father+ancestors
- O(log n) to find father
- update father's childs O(|father's childs|)
- move node (difficult) :
- O(1) to update father
- O(log n) to find old/new fathers
- update father's childs twice O(|father's childs|)
- update ancestors of all descendants (simple replace) : O(|descendants|*|depth max tree|) (depth-max : replace and create big string of max-length (depth-max) )
Overall complexity will depends of :
- depth of the tree
- balanced tree ?
- number of childs ? (in mean, max...)
- complexity of operation in a given relational database
For SELECT only, efficient, but difficult for updates.
In practice : work on RAM-size tree (with memchaed for example, keep everything in RAM) and if not posssible buy more RAM, of "cur" you tree in smaller trees.
All-descendants will cost a lot anyway, with sub-trees you can have descendants of max-depth D, without having all of them.
You "jump" form sub-tree to sub-tree : more request but faster ones AND move node way faster (only need to update a sub-tree).
精彩评论