Please criticize my proposed database design [closed]
I'm looking to get some feedback on my proposed database schema. Here's the business scenario. We need to maintain a list of a few million households identified by some anonymous ID. We also need to maintain a list of demographics and other attributes that describe those households. The list of attributes will grow over time by appending new attributes from external data files. For example, imagine marketing buys a list of dog lovers and asks for it to be loaded as a new attribute called "Likes Dogs".
Here's my proposed schema. I have separate tables for the households and for the related attributes. In the household table there's an identity column and a hashed identifier that will come from our data suppliers. The hashed value is there for privacy reasons. We're not allowed to know who the households are (e.g. we won't have name, address, etc). There will also be attribute columns for each attribute that indicate in a binary way whether the household has that attribute. The actual definition of the attributes I've separated out into another table.
HOUSEHOLD TABLE
ID | External ID | Attribute 1 | Attribute 2 | Attribute N
0 | hash val | 1 | 0 | 0
1 | hash val | 1 | 0 | 1
2 | hash val | 0 | 0 | 0
3 | hash val | 1 | 0 | 0
4 | hash val | 0 | 1 | 1
5 | hash val | 1 | 0 | 0
N | hash val | 1 | 0 | 0
HOUSEHOLD ATTRIBUTE TABLE
A开发者_开发问答ttributeID | Attribute Name | Attribute Description
0 | Rich | Has income over $x
1 | SUV | Drives SUV
2 | XBox | Owns an XBox
N | Urban | Lives in urban area
When we load new attributes and/or households we're likely receive flat files that indicate the hash val for the household the and the binary values for the attributes. The name and descriptions for the attributes will, I anticipate, be in a meta data document.
That's about all the background I have to go on at this point. I'd appreciate answers that critique my proposed schema in terms of performance (we'll be running a lot of counts against the database from a web gui) and in terms of maintainability (e.g. ease of loading new data & updating existing data). Is my proposed schema how you would approach the problem? Why/Why not? What would your design look like if you would do it differently?
Assume for the moment that we do not need to maintain versioning on these tables (though that may be a future requirement).
EDIT: To make this a little less open-ended I'd like to rephrase the question as follows. Does my database design conform to best practices for the given business scenario I outlined? If not, what should I change?
The HOUSEHOLD TABLE is pretty poor from a maintenance perspective, you have to update the table schema to add a new column each time a new HOUSEHOLD ATTRIBUTE is added, and also holds lots of redundant information.
I would add another table that links households to attributes:
HOUSEHOLD_ATTRIBUTE_RELATIONSHIP
HouseholdID | AttributeID
0 | 0
3 | 2
etc ...
Each row in this table indicates that a specific household has a specific attribute. This way you can add a new attribute without having to update your database schema.
Also, make sure to use foreign key constraints for the relationships between these tables.
You should only start to de-normalize and add redundant data if you start hitting serious performance issues.
A better design would be
Household TABLE (HouseholdID, External ID)
Attribute TABLE (HouseholdID, AttributeID)
lkp_Attribute TABLE (AttributeID, Attribute Name, Attribute Description)
This way the Household table does not have to be changed each time a new attribute is added.
I Think I would just add a column to the table when the new attributes are not that frequent, so in your case just add the column LikesDogs. That keeps things simple.
Another approach would be to normalize the data one step further, so instead of adding columns, you can add rows like this:
Create table houseHoldData as (
HouseHold_Id int, -- points to household row
Attribute_Id int, -- this points to a row in your attribute table
Value double
)
Then you can turn that into more readable format with the PIVOT operator. But I think just adding the columns would be the easiest approach. Don't overdo it if it;s not needed.
GJ
For what it's worth (as JNK says, this may be closed as there is not a "right" answer), I would really discourage having a separate column for each attribute. What if you decide to add a few attributes later? What if you want to remove some? Do you rename columns?
It would be better (in my opinion), to keep your available attributes in a separate table of House/Attribute rows, using foreign keys to point back to your House and Attribute tables.
So in your case you'd have
HouseID | AttributeID
---------------------
0 1
1 1
1 2...N
etc.
精彩评论