Database: Foreign keys or denormalized tables for immutable data?

2023-02-15 18:05 问答作者：

Given:

Person[id, name, city, state, country]

When am I better off storing city names directly in the main table versus using foreign keys to a separate cities table? For the purpose of this discussion, assume that city names are immutable. Here is what I was thinking:

Option 1: Store values inline.

Person[id, name, city, state, country]

Pro: Easy inserts. Fast queries.
Cons: Increased diskspace/memory usage
Ideal for small values, infrequent duplicates.

Option 2: Foreign keys to a separate table

Person[id, name, city_id, state_id, country_id]
Cities[id, name]
States[id, name]
Countries[id, name]

Pro: Decreased diskspace/memory usage. Potentially easier to cache frequently-used values.
Cons: Inserts are more complicated (does a city already exist or should it be 开发者_如何学Pythoninserted?)
Ideal for large values, frequent duplicates.

To quote http://en.wikipedia.org/wiki/Database_normalization: "The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships." It seems to me that these benefits fly out the window when dealing with immutable data since it never needs to be updated. As for insertion and deletion anomalies, we can use nullable columns (if needed).

What is the best-practice for this case?

Option 2: Foreign keys to a separate table

Person[id, name, city_id, state_id, country_id]
Cities[id, name]
States[id, name]
Countries[id, name]

"Normalization" doesn't mean "replace names with id numbers". You should hunt down whoever taught you that, and poke them in the eye with your finger. (Or better yet, both eyes. Two fingers.)

"Normalization" involves identifying associations between columns ("functional dependencies"), and isolating them in a different table ("projection"). Normalization increases data integrity by reducing or eliminating certain kinds of INSERT, UPDATE, and DELETE anomalies.

Later . . .

Some example data for the table of people . . .

id  name          city   state           country
--
1   John Smith    York   Alabama         United States of America
2   John Doe      York   Maine           United States of America
3   Jane Smith    York   Nebraska        United States of America
4   Jane Doe      York   South Carolina  United States of America

You're right, the original table is in 2NF. It's not in 3NF, because "country" is a fact about {city, state}. So you can replace the original table of people with these two tables.

people
id  name          city   state           
--
1   John Smith    York   Alabama         
2   John Doe      York   Maine           
3   Jane Smith    York   Nebraska        
4   Jane Doe      York   South Carolina  

cities -- key is (city, state)
city   state           country  
--
York   Alabama         United States of America
York   Maine           United States of America
York   Nebraska        United States of America
York   South Carolina  United States of America

Two things to watch: 1) Decomposition removed a column from the original table. 2) Decomposition didn't involve adding an arbitrary id number.

What normal form is each of these tables in now?

You can reduce the storage space needed by replacing country names with country codes. You might begin by storing the ISO country codes along with the other attributes in the "cities" table.

cities -- key is (city, state)
city   state           country                    iso_cc
--
York   Alabama         United States of America    US
York   Maine           United States of America    US
York   Nebraska        United States of America    US
York   South Carolina  United States of America    US

But by adding one column, we've increased the number of functional dependencies from one to four.

{city, state} -> country
{city, state} -> iso_cc
country -> iso_cc
iso_cc -> country

We can remove the two transitive dependencies by creating a table of countries. It makes sense to remove the column "country", and retain the column "iso_cc" for two reasons. The column "iso_cc" is shorter, and humans can read it. Since humans can read it, we usually won't have to join the table "countries".

cities -- key is (city, state)
city   state           iso_cc
--
York   Alabama         US
York   Maine           US
York   Nebraska        US
York   South Carolina  US

countries -- keys are iso_cc and country
iso_cc   country
--
US       United States of America

Note that the table "countries" has two candidate keys. Each column is unique. In my experience, most databases don't enforce both those constraints. Developers who simply replace names with id numbers frequently miss that second one. (That "country" is unique, not only the id number.)

A normalized database will, in addition to your points, standardize the naming of the cities. However, then you need to populate your city table with all the cities in the world. Otherwise, you'll end up with something like New York and New-York.

Unless you are planning on having city information beyond just the name (e.g. size, geo-coordinates, etc), I say stick with string city names. If your code is good, you'll be able to normalize if ever the need comes up.

You could certainly have a Countries table, with PK = ISO code. Just search the net, and you will find a ready made list. I would not care about the rest for adresses, unless your row number is really high (> 10^6), or you plan some specialised usage as Dimitry pointed.
If you decide to go that way, you could end up with a Locations table, with PK = (CountryISO, PostalCode) and extra field = City.
But are you going to save much ?

继续阅读：database-design

Database: Foreign keys or denormalized tables for immutable data?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？