What is the best practice to represent multi dimensional data in tables?
I have survey statistics - for each answer (there is an arbitary number of answers) - I need to know the age group and the income group. (E.g. - answer "a" was a开发者_开发问答nswered by 23 people of ages 30 to 35, and by 11 people with income level of 10,000 to 15,000. and by 7 people of ages 30 to 35 with income level of 10,000 to 15,000 (this can be inferred by the first two requirements. Those groups are predefined).
This is actually a "cube" With: 1st dimension - answer id, 2nd dimension - age group, 3rd dimension - income group. How should I represent it in the DB (using tables)? Can you make a generalization for cases where I need more than 3 dimensions of data? Thank you all in advance.The example has five dimensions: User
, Question
, Answer
, Survey
, and Demographic
. The Demographic
table is here to freeze the income level and age group at the time of the survey -- the user can change income and age over time. The CurrentDemographicID
in the User
table is pre-calculated before the survey, and is later copied to the DemographicKey
of the UserQA
table.
select
AgeGroup
, IncomeLevel
, count(1) as NumberOfPeople
from UserQA as f
join Question as q on q.QuestionID = f.QuestionID
join Answer as a on a.AnswerID = f.AnswerID
join Demographic as d on d.DemographicID = f.DemographicID
join Survey as s on s.SurveyID = f.SurveyID
where QuestionFullName = 'specific question here'
and AnswerFullName = 'specific answer here'
and SurveyFullName = 'specific survey here'
group by AgeGroup, IncomeLevel
order by AgeGroup, IncomeLevel ;
Also, take a look at this question/answer.
If each answer will always have both dimensions to it, then you can have one table with three columns, and those three columns are your three dimensions.
In general, a n-column table is the relational way to represent n distinct dimensions of information.
EAV might be what are you after.
It is highly probable that whoever voted down devnull's answer has insufficient experience in industrial data modelling or had experience only in simplistic exercises.
As a generalisation, for multiple dimensional analysis to be effective you need two levels data modelling.
- A substrate detail capture
- An abstract cube
The answer is indeed EAV as the substrate detail capture. People can afford to be lazy and skip the detail substrate design. Every data mining tool I had been forced (by management) to evaluate has been inadequate because the inability to perform time-profiling.
Ultimately, it required us to write our own data-mining application - because the amount of kludge to make the software work, albeit ineffectively, equals to the amount of effort it takes to create one that works effectively. And even works extra-ordinarily spectacular when we fed slices of the abstract cube into SAS/Insight. While the vendors' software took an hour to build the cube, and sometimes 24 hours, we needed our cubes to be built within 5 minutes and frequently within 15 seconds - due to effective substrate data modelling to enable efficient slicing of data to build the cube.
In industrial practice, we might need to flexibly move and range the time window of an analysis to match conditions that occured 6 months ago. Or that, we need to review the performance of processes for the last financial year.
We need to have correct time-phase correlation. Let's say a church has a per night statistic for a particular evangelistic rally as number of people "saved", number of "rededications", number of "contribution$", total contribution$, number of attendance - over 10 nights of the rally.
Then the church has statistics for number of tracts distributed by volunteer and the location each volunteer was assigned, the radio and tv advertisements and google ads placed. So the church statistician would need to collect info from each convert and $contributor - location of residence, when and where a tract was received, when and where an advertisement was encountered. And if they could collect such information from everyone who attended and who told them about it and when and where those who told them about the rally received a tract or encountered an advertisement. With such info at hand, the statistician would then be able a more truthfully matched multi-dimensional cube to let church leaders decide how to effectively organise their future pre-rally out-reach.
Data model optimization may not be possible for churches but for it is very possible for a factory full of equipment, robots, and trained operators. It is extraordinarily helpful to get phase-correlated data to treat an epidemic.
The EAV model is necessary because the number of parameters collectible, and number of attributes, vary from equipment to equipment and from process to process and from disease to disease. We cannot afford to have a table each for each set of parameters. Frequently a process or piece of equipment collects different sets of parameters, depending on the product that is being processed.
Sometimes the number of parameters collected can number a thousand. We cannot have a table with a thousand columns, can we? We even violate data normalization principles to store data in a single row or in a blob because of inefficient data access in a highly normalized table.
Further, we also need to version our data set. Let us say we designed an experiment for 2004. In 2006, we discovered that we needed to include new dimensions and discard some useless dimensions, and so we created a new version of the experiment. Then when we analyse the performance of the experiment between 2002 and 2008, we need to provide proper treatment to the change in the sets of dimensions in the experiment. In biological experiments and social behaviour surveys, the version changes would be more frequent.
I have a relational model of variable-dimensionality for attributes and parameters here: http://appdesign.blessedgeek.com/discrete-flow-resource-management. Not exactly EAV, but gives an idea what industrial multi-dimensional data modelling entails.
精彩评论