How important is database normalization in a very simple database?
I am making a very simple database (mysql) with essentially two types of data, always with a 1 to 1 relationship:
Events
- Sponsor
- Time (Optional)
- Location (City, State)
- Venue (Optional)
- Details URL
Sponsors
- Name
- URL
Cities will be duplicated often, but is there really much value in having a cities table for such a simple database schema?
The database is populated by screen-scraping a website. On this site the city field is popula开发者_运维知识库ted via selecting from a dropdown, so there will not be mistypes, etc and it would be easy to match the records up with a city table. I'm just not sure there would be much of a point even if the users of my database will be searching by city frequently.
Normalize the database now.
It's a lot easier to optimize queries on normalized data than it is to normalize a pile of data.
You say it's simple now - these things have a tendency to grow. Design it right and you'll get the experience of proper design and some future proofing.
I think you are looking at things the wrong way - you should always normalize unless you have a good reason not to.
Trusting your application to maintain data integrity is a needless risk. You say the data is made uniform because it is selected from a dropdown. What if someone hacks on the form and modifies the data, or if your code inadvertently allows a querystring param with the same name?
Where will the city data come from that populates your dropdown box for the user? Wouldn't you want a table for that?
It looks like you are treating Location as one attribute including city and state. Suppose you want to sort or analyse events by state alone rather than city and state? That could be hard to do if you don't have an attribute for state. Logically I would expect state to belong in a city table - although that may depend on exactly how you want to identify cities.
Direct answer: Just because a problem is relatively simple is no reason to not do things to keep it simple. It's a lot easier to walk on my feet than on my hands. I don't recall ever saying, "Oh, I only have to go half a mile, that's a short distance so I might as well walk on my hands."
Longer answer: If you don't keep any information about a city other than it's name, and you don't have a pre-set list of cities (e.g. to build a drop-down), then your schema is already normalized. What would be in a City table other than the city name? (I presume State cannot be dependent on City because you could have two cities with the same name in different states, e.g. Dayton OH and Dayton TN.) The relevant rule of normalization is "no non-key dependencies", that is, you cannot have data that depends on data that is not a key. If you had, say, latitude and longitude of each city, then this data would be repeated in every record that referenced the same city. In that case you would certainly want to break out a separate city table to hold the latitude and longitude. You could, of course, create a "city code" that is an integer or abbreviation that links to a city table. But if there's no other data about a city, I don't see how this gains anything.
Technically, I would assume that City depends on Venue. If the venue is "Rockefeller Center", that implies that the city must be New York. But if venue is optional, this creates problems. One possibility is to have a Venue table that lists venue name, city, and state, and for cases where you don't specify a venue, have an "unspecified" for each city. This would be more textbook correct, but in practice if in most case you do not specify a venu, it would gain little. If most of the time you DO specify a venu, it would probably be a good idea.
Oh, and, is there really a 1:1 relation between event and sponsor? I can believe that an event cannot have more than one sponsor. (In real life, there are plenty of events with multiple sponsors, but maybe for your purposes you only care about a "primary sponsor" or some such.) But does a sponsor never hold more than one event? That seems unlikely.
Why not go ahead and normalize? You write as if there are significant costs of normalizing that outweigh the benefits. It's easier to set it up in a normal form before you populate it than to try and normalize it later.
Also, I wonder about your 1-to-1 relationship. Naively, I would imagine that an event might have multiple sponsors, or that a sponsor might be involved in more than one event. But I don't know your business logic...
ETA: I don't know why I didn't notice this before, but if you are really averse to normalizing your database, and you know that you will always have a 1-to-1 relationship between the events and sponsors, then why would you have the sponsors in a separate table?
It sounds like you may be a little confused about what normalization is and why you would do it.
The answer hinges, IMO, on whether you want to prevent errors during data-entry. If you do, you will need a VENUES table:
VENUES
City
State
VenueName
as well as a CITIES and STATES table. (Note: I've seen situations where the same city occurs multiple times in the same state, usually smaller towns, so CITY/STATE do not comprise a unique dyad. Normally there's a zipcode to disambiguate.)
To prevent situations where the data-entry operator enters a venue for NY NY which is actually in SF CA, you'd need to validate the venue entry to see if such a venue exists in the city/state supplied on the record.
Then you'd need to make CITY/STATE mandatory, and have to write code to rollback the transaction and handle the error.
If you are not concerned about enforcing this sort of accuracy, then you don't really need to have CITY and STATES tables either.
If you are interested in learning about normalization, you should learn what happens when you don't normalize. For each normal form (beyond 1NF) there is an update anomaly that will occur as a consequence of harmful redundancy.
Often it's possible to program around the update anomalies, and sometimes that's more practical than always normalizing to the ultimate degree.
Sometimes, it's possible for a database to get into an inconsistent state due to failure to normalize, and failure to program the application to compensate.
In your example, the best I can come up with is a sort of lame hypotheical. What if the name of a city got mispelled in one row, but spelled correctly in all the others. What if you summarized by city and sponsor? Your output would reflect the error, and diovide one group into two groups. Maybe it would be better if the city were only spelled out once in the database, for better or for worse. At least the grouping for the summary would be correct, even if the name were mispelled.
Is this worth nromalizing for? Hey, it's your project, not mine. You decide
精彩评论