开发者

mysql database chained tables denormalization

I have a design issue with db denormalization.

I'm making a relatively large database and need to optimize it a much as possible.

Here is a very simplified model of the issue.

All the tables from the picture a chained, and with a normalized database to get for example all the users from a specific country I have to join all the tables. That is cca 250 countries x cca 12000 cities x cca 625000 regions x ? addresses x ? users... In short that is a 开发者_如何学运维lot of joining, which takes a long time.

What I want to do, is to make the country_id redundant in the user table, so I can get the same query without any joining.

The question is, what is the best practice to keep consistency in such a model (btw. using MySql)?

One way and probably fastest is to ensure consistency on the application level, when inserting/updating/deleting data.

Other is stored procedures, which I really see no advantages. They ensure consistency only if directly called. The consistency breaks if you want to do some changes without the procedures.

I have also been looking at triggers... not really sure how to implement it, and how much would I gain in performance.

Anyway, it would be preferable to ensure consistency on the db level.

Any advice?

mysql database chained tables denormalization


All the tables from the picture a chained, and with a normalized database to get for example all the users from a specific country I have to join all the tables.

You have to join all the tables, because you use surrogate keys (id numbers), not because the tables are "normalized". Using surrogate keys like id numbers has nothing to do with normalization.

Natural keys and foreign key constraints is all you need to solve your problem.

The easiest way to see how this works is to start with full data, and work completely backwards. Assume all the data is correct.

addr_id  street           street_num       region    city          country
--
1        Babukiaeeva      3a               10000     Zagreb        Croatia
2        Riva             16               51000     Rijeka        Croatia 
3        Andrije Hebranga 2-4              10000     Zagreb        Croatia
4        Andrijeviaeeva   2               110000     Zagreb        Croatia

To record facts like "Region '10000' is associated with the city 'Zagreb' in the country 'Croatia'", create a new table, and populate it from this query.

SELECT DISTINCT region, city, country from addresses;

Table will look like this.

Table: regions
Primary key: {region, city, country}

region   city      country
--
10000    Zagreb    Croatia
51000    Rijeka    Croatia
110000   Zagreb    Croatia

Then set a foreign key reference.

ALTER TABLE addresses 
ADD CONSTRAINT FOREIGN KEY        (region, city, country) 
               REFERENCES regions (region, city, country);

To record facts like "City 'Zagreb' is in country 'Croatia'", create a new table, populate it from this query.

SELECT DISTINCT city, country from regions;

Table will look like this.

Table: cities
Primary key: {city, country}

city      country
--
Zagreb    Croatia
Rijeka    Croatia

Then set a foreign key reference.

ALTER TABLE regions 
ADD CONSTRAINT FOREIGN KEY       (city, country) 
               REFERENCES cities (city, country);

Repeat for countries. The tables countries, cities, and regions are all key, so they're in 5NF. (They can't have any non-key dependencies, because they have no non-key columns.) In the context of a large area, like all of Europe, it's likely that the table of addresses is also in 5NF.

It terms of query performance it's likely to run rings around your current schema, because it needs no joins.

You'll probably want to use ON UPDATE CASCADE; you might not want to cascade deletes, though.


First of all - is it really too slow? Have you tried it? Do you hava an app where you dump all users (why?) or are you fetching a single/few users now and then. Since you have primary keys on all those ids, the retrieval shouldn't be that slow, there is a B-tree in the background after all.

Secondly, I wouldn't normalize on the street number level, you'll hardly get any benefits from that, and you'll probably end up having an almost 1:1 relationship between user and an address. So, move your street number to the client, or you can even move the entire address table to the user. I would probably move the region table also to the client (these are city regions?) and end up with user, city and country tables.

Then you'd have two joins, if that is still too slow, you can put the redundant country key (or, since we're denormalizing - country name) in the users. I'd use triggers to maintain integrity, more precisely you'll have to write: insert and update triggers (update needs to fire only when regionId/cityId changes), and an update trigger on country (if you have country name in the user table) in the unlikely event that the country's name changes.

Performance-wise, you wouldn't gain, but lose with triggers, but I suppose that inserts and updates on the user table are not so frequent that you'd notice it at all.

Finally, since you didn't explain in detail the nature and size of you (web?) app, just a reminder that you might want to also consider/include other optimizations technologies outside the relational databases (caches, no-sql dbs, etc.).


It's not really a lot of joining assuming you are not trying to denormalize all your data.

On the other hand, most people just have an address table, or (gasp!) keep the address information in the user table. How many countries/ cities are you expecting to support, vs how many users?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜