Is normalization required in this case?
Let's say I have a database with a table like this:
id | fundraiser_id | donation_amount | name | sex | university
This is an analogous version of my real table. This table tracks donations during a fundraiser. It is very likely that the same person will donate multiple times for each fundraiser (they are very generous).
A user would enter this data like a spreadsheet. They won't necessarily care that they entering the same name, sex, and university for multiple rows. After all, they aren't paying for the storage space.
My question is this: Should I normalize this table by attempting, on table update, to extract distinct values for the set of name
, sex
, university
and save them in a People table. Then, my new table would look like this:
id | fundraiser_id | donation_amount | people_id
Having the user predefine the people they will use in their table is not an option.
I discussed this situation in a previous question, but I felt that I didn't give enough detail. Here are the pros and cons that I (and the others that helped me in that previous question) came up with:
Pros:
- Less wasted space
- More "normal"
- Faster to ans开发者_运维问答wer some queries
Cons:
- Takes resources to ensure that I don't add duplicate people to the People table
- Will take a Cron job to kill orphaned entries in the People table
- Extra queries required no matter how I look at it
Thank you for any advice.
Just one more clarification:
I'm not concerned at all about two people having the same name, sex, and university. That information is just displayed and not acted on, so two different people are still one in the same to my application. I am only suggesting this normalization as a way to save space by creating a lookup table.
Given your updates question, an updated answer:
I really don't see any benefit in splitting up this table
id | some_unique_field | name | sex | university
You should definitely split up this table:
id | fundraiser_id | donation_amount | name | sex | university
Into this:
donation
id | fundraiser_id | donation_amount | donator_id
fundraiser
id | charity | ....
donator
id | name | sex | university
This will make it easier to get the amount donated per person.
If you want to import your spreadsheet, I'm recommend doing something like this:
Step 1: create a blackhole table like so:
DROP TABLE IF EXISTS `test`.`bh_donations`;
CREATE TABLE `test`.`bh_donations` (
`fundraiser_name` varchar(45) NOT NULL,
`donation_amount` decimal(10,2) NOT NULL,
`name` varchar(45) NOT NULL,
`sex` char(1) NOT NULL,
`university` varchar(45) NOT NULL
) ENGINE=BLACKHOLE DEFAULT CHARSET=latin1;
You don't need an id
here, but if it simplifies your code, by all means add it.
Step 2, add a trigger to the blackhole table to process it.
DELIMITER $$
CREATE TRIGGER bi_bh_donations BEFORE INSERT ON bh_donations FOR EACH ROW
BEGIN
DECLARE mydonater_id integer;
DECLARE myfundraiser_id integer;
SELECT f.id INTO myfundraiser_id FROM fundraiser f
WHERE f.name = new.fundraiser_name LIMIT 1;
IF f.id IS NULL THEN BEGIN
SELECT error_fundraiser_is_unknown FROM table_error;
END; END IF;
SELECT d.id INTO mydonator_id FROM donator d
WHERE d.name = new.name AND d.sex = new.sex AND d.university = new.university
LIMIT 1;
IF mydonator_id IS NULL THEN BEGIN
INSERT INTO donator (name, sex, university)
VALUES (new.name, new.sex, new,university);
END; END IF;
SELECT LAST_INSERT_ID() INTO mydonator_id;
INSERT INTO donation (fundraiser_id, donation_amount, donator_id)
VALUES (myfundraiser_id, new.amount, mydonater_id);
END$$
DELIMITER ;
Step 3 use LOAD DATA INFILE to import the data
LOAD DATA INFILE 'data.csv' INTO TABLE bh_donations
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Now you can just read the excel lines into the blackhole table by saving the Excel file as a CSV.
And using LOAD DATA INFILE
to read the data into the blackhole table.
See:
blackhole tables: http://dev.mysql.com/doc/refman/5.0/en/blackhole-storage-engine.html
more blackhole: Creative uses for the blackhole engine
triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html
load data infile: http://dev.mysql.com/doc/refman/5.5/en/load-data.html
Hope this helps.
Let's try to put together some reasonable sample data. Assume that the column 'id' is the primary key, and that people don't make spelling mistakes.
Table: donations
id fundraiser_id donation_amount name sex university
--
1 100 $100 Kim Stack M Rivier College
2 100 $150 Kim Stack M Rivier College
3 100 $45 Marguerite Meade F Rivier College
4 100 $100 Marie Dew F Rivier College
5 100 $100 Kim Stack F Midway College
6 100 $100 Kim Stack F Mars Hill College
...
98 200 $135 Kim Stack M Rivier College
99 200 $400 Kim Stack M Midway College
Let's try to find a transitive dependency having to do with people.
- name->sex: No, there are males and females named "Kim Stack".
- name->university: No.
- name, sex->university: No, two females named "Kim Stack" go to different colleges.
- sex, university->name: No, there are two different females at Rivier College.
- name, university->sex: No, there's a male and female student named "Kim Stack" at Midway College.
(You can keep going.)
This table has no transitive dependencies, and "id" is the only candidate key: this table is already in 5NF.
As useful as it might be, replacing a name with an id number—or replacing {name, sex, university} with an id number—has nothing to do with normalization.
Your design violates the database normalization principle that: Non-key fields should depend solely on the key column
Your data can be looked up using either the id column or the some_unique_field column; in effect you have two key columns.
In database design, if natural data is available to use as a key, you should use that (eg passport number). In your case, there is no need to have the id column. However, if you must have it (for example for performance reasons you want a small-sized key, instead of some very long value so you can get more index entries per I/O page and thus faster index look ups), then do it.
Your "normalization" attempt actually doesn't address the fundamental problem of having "two keys" - your new table still has this problem.
You should allways try to normalize. The only case in wich you wouldn't do so is in case you have a significant gain from not doing so. In this case, you only need to do a lookup to check if the data is already in the database, and since those fields would likely be indexed, it'd be a quite cheap query. Also, in case of using MySQL (don't know about other databases atm), you could use INSERT IGNORE to avoid that lookup.
About the orphaned children... you could use a TRIGGER to ensure that you don't leave any orphaned entry, unless rows from the People table are commonly deleted.
You could create a UNIQUE index on (name, sex, university). That would prevent duplicate entries from entering into the data base to start with.
There's a separate problem, however, which may be of concern to you: detect spelling variations (e.g., "My U." vs. "My Univ." vs. "My University").
精彩评论