Does it Make Sense to Map a Graph Data-structure into a Relational Database?
Specifically开发者_如何转开发 a Multigraph.
Some colleague suggested this and I'm completely baffled.
Any insights on this?
It's pretty straightforward to store a graph in a database: you have a table for nodes, and a table for edges, which acts as a many-to-many relationship table between the nodes table and itself. Like this:
create table node (
id integer primary key
);
create table edge (
start_id integer references node,
end_id integer references node,
primary key (start_id, end_id)
);
However, there are a couple of sticky points about storing a graph this way.
Firstly, the edges in this scheme are naturally directed - the start and end are distinct. If your edges are undirected, then you will either have to be careful in writing queries, or store two entries in the table for each edge, one in either direction (and then be careful writing queries!). If you store a single edge, i would suggest normalising the stored form - perhaps always consider the node with the lowest ID to be the start (and add a check constraint to the table to enforce this). You could have a genuinely unordered representation by not having the edges refer to the nodes, but rather having a join table between them, but that doesn't seem like a great idea to me.
Secondly, the schema above has no way to represent a multigraph. You can extend it easily enough to do so; if edges between a given pair of nodes are indistinguishable, the simplest thing would be to add a count to each edge row, saying how many edges there are between the referred-to nodes. If they are distinguishable, then you will need to add something to the node table to allow them to be distinguished - an autogenerated edge ID might be the simplest thing.
However, even having sorted out the storage, you have the problem of working with the graph. If you want to do all of your processing on objects in memory, and the database is purely for storage, then no problem. But if you want to do queries on the graph in the database, then you'll have to figure out how to do them in SQL, which doesn't have any inbuilt support for graphs, and whose basic operations aren't easily adapted to work with graphs. It can be done, especially if you have a database with recursive SQL support (PostgreSQL, Firebird, some of the proprietary databases), but it takes some thought. If you want to do this, my suggestion would be to post further questions about the specific queries.
It's an acceptable approach. You need to consider how that information will be manipulated. More than likely you'll need a language separate from your database to do the kinds graph related computations this type of data implies. Skiena's Algorithm Design Manual has an extensive section graph data structures and their manipulation.
Without considering what types of queries you might execute, start with two tables vertices
and edges
. Vertices are simple, an identifier and a name. Edges are complex given the multigraph. Edges should be uniquely identified by a combination two vertices (i.e. foreign keys) and some additional information. The additional information is dependent on the problem you're solving. For instance, if flight information, the departure and arrival times and airline. Furthermore you'll need to decide if the edge is directed (i.e. one way) or not and keep track if that information as well.
Depending on the computation you may end up with a problem that's better solved with some sort of artificial intelligence / machine learning algorithm. For instance, optimal flights. The book Programming Collective Intelligence has some useful algorithms for this purpose. But where the data is kept doesn't change the algorithm itself.
Well, the information has to be stored somewhere, a relational database isn't a bad idea.
It would just be a many-to-many relationship, a table of a list of nodes, and table of a list of edges/connections.
Consider how Facebook might implement the social graph in their database. They might have a table for people and another table for friendships. The friendships table has at least two columns, each being foreign keys to the table of people.
Since friendship is symmetric (on Facebook) they might ensure that the ID for the first foreign key is always less than the ID for the second foreign key. Twitter has a directed graph for its social network, so it wouldn't use a canonical representation like that.
精彩评论