Data warehouse - Dimension Modeling
I am new to BI/Datawarehousing, and after building some easy samples, I have the need to build a more complex structure. My project initially involved product licenses, and I was measuring how many sold, by month/year and by program, and just counting the number of licenses.
Now the requirement is to introduce jump offs from those metri开发者_如何学Gocs. As in, when you come to a certain group of licenses, they want to see a whole different metrics of those. Such as, if 100 licenses were sold in mar 2011, how many of them installed, activated and cancelled the product. (we track that info, but not in the DW). So, I am looking for the best way to do this...I assume the first thing I have to do is add three dimensions for installed, activated and cancelled - and have three fact tables? Or have one fact table with each license, and have a row for cancelled, installed or activated? (so one license may be repeated). Or have one fact table, with different fields for installed, cancelled, activated? Also, how do you relate one fact table to another? Is it through dimensions, or they can related in some other ways?
Any help would be much appreciated!
EDIT:
Thanks for the post... I was also thinking the second option is probably the correct one. But in this implementation, I have a unique problem. So, one of the facts that is measured is the number of licenses that are sold - by date of course. Lets say I add a row for installed, cancelled, activated. The requirement is for them to be able to see a connected fact. For example, if I add individual rows, given a timeframe, I can tell how many were sold, and how many were installed.
But they want to see given a timeframe, how many were bought, and out of them, how many installed. e.g., if the timeframe is march, and 100 were sold in march, out of those 100, how many were installed - even though they could have installed much later than march, and therefore the row date would be not in the timeframe they are looking at....is this a common problem? how is it solved?
I assume the first thing I have to do is add three dimensions for installed, activated and cancelled - and have three fact tables?
Not really. A license sale is a fact. It has a price.
A license sale has has dimensions like date, product, customer and program.
An "installation" or "activitation" is a state-change event of a license. You have "events" for each license (sale, install, activate, etc.)
So a license has a "sale" fact, an "installation" fact and an "activation" fact. Each of which is (minimally) a relationship with time.
Or have one fact table with each license, and have a row for cancelled, installed or activated? (so one license may be repeated).
This gives the most flexibility, because each event can be rich with multiple dimensions. A sequence of events can be then be organized to provide the history of a license.
This works out very well.
You will often want to create summary tables for simple counts and sums to save having to traverse all events for the most common dashboard metrics.
The requirement is for them to be able to see a connected fact.
Right. You're joining several rows from the fact table together. A row where the event was sold, outer joined with a row where the event was installed outer joined with row where the event was activated, etc. It's just outer joins among the facts.
So. Count of sales in March is easy. Event = "Sale". Time is all the rows where time.month = "march". Easy.
Count of sales in march which became installs. Same "march sales" where clause outer joined with all "install" events for those licenses. Count of "sales" is the same as count(*). Count of installs may be smaller because the outer join puts in some nulls.
Count of sales in march which became activations. The "march sales" where clause outer joined with all "activation" events. Note that the activation has no date constraint.
Or have one fact table, with different fields for installed, cancelled, activated?
This doesn't work out as well because the table's columns dictate a business process. That business process might change and you'll be endlessly tweaking the columns in the fact table.
Having said it doesn't work out "as well" means it doesn't give ultimate flexibility. In some cases, you don't need ultimate flexibility. In some cases, the industry (or regulations) may define a structure that's quite fixed.
Also, how do you relate one fact table to another? Is it through dimensions, or they can related in some other ways?
Dimensions by definition. A fact table only has two things -- measurements and FK's to dimensions.
Some dimensions (like "license instance") are degenerate because the dimension may have almost no usable attributes other than a PK.
So you have an "sold" fact that ties to a license, a optional "installed" fact that ties to a license and an optional "activate" fact that ties to a license. The license is an object ID (the database surrogate key) and -- perhaps -- the license identifier itself (maybe a license serial number or something outside the database).
Please by Ralph Kimball's Data Warehouse Toolkit before doing anything more.
精彩评论