开发者

Postgres: one table with many columns or several tables with fewer columns?

My question relates to the innards of how Postgres works:

I have a table:


CREATE TABLE A (
   id SERIAL,  
   name VARCHAR(32),
   type VARCHAR(32) NOT NULL, 
   priority SMALLINT NOT NULL,
   x SMALLINT NOT NULL,
   y SMALLINT NOT NULL,
   start timestamp with time zone,
   end timestamp with time zone,
   state Astate NOT NULL,
   other_table_id1 bigint REFERENCES W,
   other_table_id2 bigint NOT NULL REFERENCES S,
   PRIMARY KEY(id)
); 

with additional indexes on other_table_id1, state and other_table_id2.

The table is quite large and sees very many updates on columns: other_table_id1, state. A few updates for start and end columns, but the rest are immutable. (Astate is an enumerated type for column state.)

I'm wondering whether it makes sense to split out the two most frequently updated columns to a separate table. What I'm hoping to gain is performance, for when I'm just looking up that info, or to reduce the weight of updates because (maybe?) reading and writing the shorter row is less costly. But I need to weigh that against the cost of joins when they are (occasionally) needed to have all the data for a particular item all at once.

At one point, i was under the impression that each column is stored separately. But later, I modified my thinking when I read somewhere that lessening the width of a column on one side of the table does positively affect the performance when looking up data using another column (because the row is stored together, so the overall row length would be shorter). So I'm now under the impression that all the data for a row are physically stored together on disk; so the proposed splitting of the table sounds like it would be helpful. When I currently write 4 bytes to update the state, am I to believe I'm rewriting the 64 bytes of text (name, type) that actually never change?

I'开发者_如何学编程m not very experienced with table "normalization" and not familiar with the internals of Postgres, so I'm looking for advice and esp best practices for estimating the tradeoff without having to do the work first, then determine whether that work was worthwhile. The change would require a fair bit of effort in rewriting queries that have already been highly optimized, so I would rather go in with a good understanding of what result I can expect. Thanks, M.


There is a definite cost to updating a larger row.

A formula can help with this. If you do not split, your costs are

Cost = xU + yS

where:

U = an update of the entire row (table is not split)

S = cost of a select

x,y = count of actions

Then, if you split it, you are trying to figure this:

Cost = gU1 + hU2 + xS1 + yS2

where

U1 = update of smaller table (lower cost)

U2 = update of larger table (lower cost)

S1 = select from smaller table

S2 = select from larger table

g,h,x,y = how often the individual actions occur

So if g >> h, it pays to break them up. Especially if x >> y then it really pays.

EDIT: In response to comments, I would also point out that these costs become far more important if the database is under sustained load, no inactivity. If instead the server does not experience sustained load, it is mostly inactive with just 1 or 2 trx per second, with long stretches (where "long" = a few seconds) of inactivity, then, if it were me, I would not complicate my code because the performance benefit would not appear as a real measurable thing.


One of Postgresql's implementation details that has a bearing on this is that it never "updates" rows stored on disk, it always writes new versions. So there is no quick win by having fixed-width columns together at the start like there is with Oracle, for example (iirc).

It is true that grouping columns together in different tables based on whether they tend to be updated together can produce less garbage that has to be vacuumed up. Experimenting and measuring results is key here. If you have some data that is updated frequently, you should investigate the "fillfactor" setting on a table, for example. This setting makes PostgreSQL leave some free space in table pages when inserting, allowing updated versions of rows to be added to the same page as the previous version where possible: this can reduce the burden of an update as it may mean the indices pointing at the row don't have to be updated, at the expense of having the table take up more space on disk overall.

As mentioned by Xaade, there is a lot of material to get into on this subject. I'd like to reinforce my comment about needing to measure the impact of any changes made.. sometimes what can seem like a big win turns out not to be in practice.


Its worth splitting it up no matter how the columns are stored. You'd run into a lot less problems with concurrency, speed up lookups on partial data, speed up index searching by providing three indexes to search with without having to make those secondary keys, and so forth.

You can lessen the impact of inner joins by either cheating, or by only allowing so many rows viewed at one time. You can cheat by providing an interface rather than allowing direct lookups, only displaying inner joined data on visible rows (you can only view so many rows on the screen at one time), or by displaying additional data for the currently selected row, or by only allowing X rows per lookup with browse buttons. If you use the cheat, make sure you cache the results of an extended lookup.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜