Database: Foreign keys or denormalized tables for immutable data?
Given:
Person[id, name, city, state, country]
When am I better off storing city names directly in the main table versus using foreign keys to a separate cities
table? For the purpose of this discussion, assume that city names are immutable. Here is what I was thinking:
Option 1: Store values inline.
Person[id, name, city, state, country]
- Pro: Easy inserts. Fast queries.
- Cons: Increased diskspace/memory usage
- Ideal for small values, infrequent duplicates.
Option 2: Foreign keys to a separate table
Person[id, name, city_id, state_id, country_id]
Cities[id, name]
States[id, name]
Countries[id, name]
- Pro: Decreased diskspace/memory usage. Potentially easier to cache frequently-used values.
- Cons: Inserts are more complicated (does a city already exist or should it be 开发者_如何学Pythoninserted?)
- Ideal for large values, frequent duplicates.
To quote http://en.wikipedia.org/wiki/Database_normalization: "The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships." It seems to me that these benefits fly out the window when dealing with immutable data since it never needs to be updated. As for insertion and deletion anomalies, we can use nullable columns (if needed).
What is the best-practice for this case?
Option 2: Foreign keys to a separate table
Person[id, name, city_id, state_id, country_id]
Cities[id, name]
States[id, name]
Countries[id, name]
"Normalization" doesn't mean "replace names with id numbers". You should hunt down whoever taught you that, and poke them in the eye with your finger. (Or better yet, both eyes. Two fingers.)
"Normalization" involves identifying associations between columns ("functional dependencies"), and isolating them in a different table ("projection"). Normalization increases data integrity by reducing or eliminating certain kinds of INSERT, UPDATE, and DELETE anomalies.
Later . . .
Some example data for the table of people . . .
id name city state country
--
1 John Smith York Alabama United States of America
2 John Doe York Maine United States of America
3 Jane Smith York Nebraska United States of America
4 Jane Doe York South Carolina United States of America
You're right, the original table is in 2NF. It's not in 3NF, because "country" is a fact about {city, state}. So you can replace the original table of people with these two tables.
people
id name city state
--
1 John Smith York Alabama
2 John Doe York Maine
3 Jane Smith York Nebraska
4 Jane Doe York South Carolina
cities -- key is (city, state)
city state country
--
York Alabama United States of America
York Maine United States of America
York Nebraska United States of America
York South Carolina United States of America
Two things to watch: 1) Decomposition removed a column from the original table. 2) Decomposition didn't involve adding an arbitrary id number.
What normal form is each of these tables in now?
You can reduce the storage space needed by replacing country names with country codes. You might begin by storing the ISO country codes along with the other attributes in the "cities" table.
cities -- key is (city, state)
city state country iso_cc
--
York Alabama United States of America US
York Maine United States of America US
York Nebraska United States of America US
York South Carolina United States of America US
But by adding one column, we've increased the number of functional dependencies from one to four.
{city, state} -> country
{city, state} -> iso_cc
country -> iso_cc
iso_cc -> country
We can remove the two transitive dependencies by creating a table of countries. It makes sense to remove the column "country", and retain the column "iso_cc" for two reasons. The column "iso_cc" is shorter, and humans can read it. Since humans can read it, we usually won't have to join the table "countries".
cities -- key is (city, state)
city state iso_cc
--
York Alabama US
York Maine US
York Nebraska US
York South Carolina US
countries -- keys are iso_cc and country
iso_cc country
--
US United States of America
Note that the table "countries" has two candidate keys. Each column is unique. In my experience, most databases don't enforce both those constraints. Developers who simply replace names with id numbers frequently miss that second one. (That "country" is unique, not only the id number.)
A normalized database will, in addition to your points, standardize the naming of the cities. However, then you need to populate your city table with all the cities in the world. Otherwise, you'll end up with something like New York and New-York.
Unless you are planning on having city information beyond just the name (e.g. size, geo-coordinates, etc), I say stick with string city names. If your code is good, you'll be able to normalize if ever the need comes up.
You could certainly have a Countries table, with PK = ISO code. Just search the net, and you will find a ready made list. I would not care about the rest for adresses, unless your row number is really high (> 10^6), or you plan some specialised usage as Dimitry pointed.
If you decide to go that way, you could end up with a Locations table, with PK = (CountryISO, PostalCode) and extra field = City.
But are you going to save much ?
精彩评论