Large Join Tables and Scaling

2023-03-30 20:19 问答作者：

The Problem

We have a rapidly growing database with several large join tables (currently in the billions of rows), but as these tables have grown the query time has suffered. The concern is that as more data is added to the tables linked by these join tables, the join tables will continue to grow at a faster pace and adversely impact query speed.

The Background

I am dealing with a database that is storing genomic information. A number of markers (~3 million) corresponding to loci where there are DNA variations are linked to individuals that have had their genotype determined at these loci. Every marker has a number of possible genotypes of which every individual must have one.

The Current Implementation

When the database (postgresql) was still small, there were no problems in linking the genotypes to the the markers using foreign keys and then linking the individuals to their genotypes through a join table. That way, it would be easy to look up all of an individual's genotypes or look up all the individuals having a specific genotype.

A slimmed down version of these tables are listed below:

                                        Table "public.genotypes"
      Column      |            Type             |                       Modifiers                        
------------------+-----------------------------+--------------------------------------------------------
 id               | integer                     | not null default nextval('genotypes_id_seq'::regclass)
 ref_variation_id | integer                     | 
 value            | character varying(255)      |  
Indexes:
    "genotypes_pkey" PRIMARY KEY, btree (id)
    "index_genotypes_on_ref_variation_id" btree (ref_variation_id)


Table "public.genotypes_individuals"
    Column     |  Type   | Modifiers 
---------------+---------+-----------
 genotype_id   | integer | 
 individual_id | integer | 
Indexes:
    "index_genotypes_individuals_on_genotype_id_and_individual_id" UNIQUE, btree (genotype_id, individual_id)
    "index_genotypes_individuals_on_genotype_id" btree (genotype_id)

                                       Table "public.individuals"
    Column     |            Type             |                        Modifiers                         
---------------+-----------------------------+----------------------------------------------------------
 id            | integer                     | not null default nextval('individuals_id_seq'::regclass)
 hap_id 开发者_高级运维       | character varying(255)      | 
 population_id | integer                     | 
 sex           | character varying(255)      | 
Indexes:
    "individuals_pkey" PRIMARY KEY, btree (id)
    "index_individuals_on_hap_id" UNIQUE, btree (hap_id)

The bottleneck right now is looking up all of the genotypes for an individual and having them sorted by their positions. This is used frequently and much more important than looking up individuals from a genotype. Examples of some of these queries are:

A simple lookup of all of an individual's genotypes

SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 )
Normally, though this gets limited, because there are a lot of genotypes. We're often only interested in those on a specific chromosome.

SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 ) AND ("genotypes".ref_variation_id IN (37142, 37143, ...))
We also still need to occasionally go the other way.

SELECT * FROM "individuals" INNER JOIN "genotypes_individuals" ON "individuals".id = "genotypes_individuals".individual_id WHERE ("genotypes_individuals".genotype_id = 53430)

Every time a new individual is added to the db, the join table grows by about 3 million rows. Intuitively from a design perspective, this seems bad because adding new individuals will slow down the performance on any process using the existing data.

I understand that databases are designed to handle large tables efficiently, but we are already hitting bottlenecks due to the drive IO. An individual query is still inconsequential, but 1000s of them add up quickly. We can alleviate this problem somewhat by spreading the db across multiple drives. However, I wanted to see if there are any other alternatives out there. I have been wondering if it is somehow possible to segregate the join table entries by individual_id, which would maybe leave lookups from individuals to genotypes unimpacted by adding additional individual-genotype rows to the join table. Or do indices already do that?

Have you looked at table partitioning?

I would consider testing a schema that used natural keys instead of id numbers.

Your lookup of all of an individual's genotypes

SELECT * 
FROM "genotypes" 
INNER JOIN "genotypes_individuals" 
        ON "genotypes".id = "genotypes_individuals".genotype_id 
WHERE ("genotypes_individuals".individual_id = 2946 )

becomes

SELECT * 
FROM genotypes_individuals
WHERE (individual_id = 2946)

Sometimes that's faster. Sometimes it's not.

Switching to natural keys on our production system increased median performance by a factor of 10. Some queries ran 100 times faster with natural keys, because natural keys eliminated a lot of joins. Some queries ran slower, too. But the median speed-up was impressive anyway.

继续阅读：database database-design relational-database

Large Join Tables and Scaling

The Problem

The Background

The Current Implementation

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The Problem

The Background

The Current Implementation

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？