PostgreSQL: How to optimize my database for storing and querying a huge graph
I'm running PostgreSQL 8.3 on a 1.83 GHz Intel Core Duo Mac Mini with 1GB of RAM and Mac OS X 10.5.8. I have a stored a huge graph in my PostgreSQL database. It consists of 1.6 million nodes and 30 million edges. My database schema is like:
CREATE TABLE nodes (id INTEGER PRIMARY KEY,title VARCHAR(256));
CREATE TABLE edges (id INTEGER,link INTEGER,PRIMARY KEY (id,link));
CREATE INDEX id_idx ON edges (id);
CREATE INDEX link_idx ON edges (link);
The data in the table edges looks like
id link 
1  234
1  88865
1  6
2  365
2  12
...
So it stores for each node with id x the outgoing link to id y.
The time for searching all the outgoing links is ok:
=# explain analyze select link from edges where id=4620;
                           QUERY PLAN                                                        
    ---------------------------------------------------------------------------------
     Index Scan using id_idx on edges  (cost=0.00..101.61 rows=3067 width=4) (actual time=135.507..157.982 rows=1052 loops=1)
       Index Cond: (id = 4620)
     Total runtime: 158.348 ms
    (3 rows)
However, if I search for the incoming links to a node, the database is more than 100 times slower (although the resulting number of incoming links is only 5-10 times higher than the number of outgoing links):
=# explain analyze select id from edges where link=4620;
                         QUERY PLAN                                                           
----------------------------------------------------------------------------------
     Bitmap Heap Scan on edges  (cost=846.31..100697.48 rows=51016 width=4) (actual time=322.584..48983.478 rows=26887 loops=1)
       Recheck Cond: (link = 4620)
       ->  Bitmap Index Scan on link_idx  (cost=0.00..833.56 rows=51016 width=0) (actual time=298.132..298.132 rows=2开发者_开发问答6887 loops=1)
             Index Cond: (link = 4620)
     Total runtime: 49001.936 ms
    (5 rows)
I tried to force Postgres not to use a Bitmap Scan via
=# set enable_bitmapscan = false;
but the speed of the query for incoming links didn't improve:
=# explain analyze select id from edges where link=1588;
                      QUERY PLAN                                                           
-------------------------------------------------------------------------------------------
 Index Scan using link_idx on edges  (cost=0.00..4467.63 rows=1143 width=4) (actual time=110.302..51275.822 rows=43629 loops=1)
   Index Cond: (link = 1588)
 Total runtime: 51300.041 ms
(3 rows)
I also increased my shared buffers from 24MB to 512MB, but it didn't help. So I wonder why my queries for outgoing and incoming links show such an asymmetric behaviour? Is something wrong with my choice of indexes? Or should I better create a third table holding all the incoming links for a node with id x? But that would be quite a waste of disk space. But since I'm new into SQL databases maybe I'm missing something basic here?
I guess it is because of a “density” of same-key-records on the disk. I think the records with same id are stored in dense (i.e., few number of blocks) and those with same link are stored in sparse (i.e., distributed to huge number of blocks). If you have inserted records in the order of id, this situation can be happen.
Assume that: 1. there are 10,000 records, 2. they're stored in the order such as (id, link) = (1, 1), (1, 2),..., (1, 100), (2, 1)..., and 3. 50 records can be stored in a block.
In the assumption above, block #1~#3 consists of the records (1, 1)~(1, 50), (1, 51)~(1, 100) and (2, 1)~(2, 50) respectively.
When you SELECT * FROM edges WHERE id=1, only 2 blocks (#1, #2) is to be loaded and scanned.
On the other hand, SELECT * FROM edges WHERE link=1 requires 50 blocks (#1, #3, #5,...), even though the number of rows are same.
If you need good performance and can deal without foreign key constraints (or use triggers to implement them manually) try the intarray and intagg extension modules. Instead of the edges table have an outedges integer[] column on nodes table. This will add about 140MB to the table, so the whole thing will still probably fit into memory. For reverse lookups, either create an GIN index on the outedges column (for an additional 280MB), or just add an inedges column.
Postgresql has pretty high row overhead so the naive edges table will result in 1G of space for the table alone, + another 1.5 for the indices. Given your dataset size, you have a good chance of having most of it in cache if you use integer arrays to store the relations. This will make any lookups blazingly fast. I see around 0.08ms lookup times to get edges in either direction for a given node. Even if you don't fit it all in memory, you'll still have a larger fraction in memory and a whole lot better cache locality.
I think habe is right.
You can check this by using cluster link_idx on edges; analyze edges after filling the table. Now the second query should be fast, and first should be slow.
To have both queries fast you'll have to denormalize by using a second table, as you have proposed. Just remember to cluster and analyze this second table after loading your data, so all egdes linking to a node will be physically grouped.
If you will not query this all the time and you do not want to store and backup this second table then you can create it temporarily before querying:
create temporary table egdes_backwards
  as select link, id from edges order by link, id;
create index edges_backwards_link_idx on edges_backwards(link);
You do not have to cluster this temporary table, as it will be physically ordered right on creation. It does not make sense for one query, but can help for several queries in a row.
Your issue seems to be disk-io related. Postgres has to read the tuples of index matches in order to see whether or not the row is visible (this can not be done from an index as it doesn't contain the necessary information).
VACUUM ANALYZE (or simply ANALYZE) will help if you have lots of deleted rows and/or updated rows. Run it first and see if you get any improvements.
CLUSTER might also help. Based on your examples, I'd say using link_idx as the cluster-key. "CLUSTER edges USING link_idx". It might degrade the performance of your id queries though (your id queries might be quick because they are already sorted on disk). Remember to run ANALYZE after CLUSTER.
Next steps include fine-tuning memory parameters, adding more memory, or adding a faster disk subsystem.
have you tried doing this in www.neo4j.org? This is almost trivial in a graph database and should give you performance on your usecase in ms-range.
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论