Start and end period in each fact within a data warehouse
I have been asked to add a new table to our data warehouse. Currently, we separate our facts into monthly, quarterly, and yearly tables, with time dimensions for each. Each fact record has one time value. The data is generated in the source system by start and end period, and the end date becomes the time dimension value of the fact record. The flow of the fact into either the month, quarter, or year fact table tells one how to understand the dates in the records and how to use them.
I've been asked to have the new table contain start and end dates in each record. I have been told that this violates a data warehousing principle, but it better represents the way the data is generated and allows for more flexible querying of the data, e.g. for rolling periods, etc.
I'm no data warehouse expert. I understand that a single time dimension per fact is a principle. My question is, what are the ramifications of breaking that principle? In other words, 开发者_StackOverflow中文版what are the arguments against doing so? What problems might I face in the future in doing so? It seems to me that having the start and end periods for each fact does better represent the data, but I admit that I don't know enough to fully evaluate the implications of this design choice. Can anyone perhaps provide some prespective?
Edit: I appreciate these answers. They at least tell me it isn't as bad a practice as I was led to believe. I will clarify one thing about the dates: They don't represent a period of validity, but rather a period of aggregation. Thus, a fact record may represent an average of pounds used of a certain ingredient as calculated for an arbitrary period of months. Don't know if this makes any difference, but there it is.
It might be time to grab a good datawarehousing book, I recommend something from the Kimball Group, Ralph Kimball is pretty much the goto for quickly getting started in datawarehousing. I can elaborate further if it is helpful but I'll start with two points that may help get you turned around and making progress.
It is VERY common to have multiple time dimensions per fact. Someone gave you incorrect information when telling you that violated accepted normal practices. As an example for an "order" fact you would commonly have an order date, a ship date, a delivery date, Period, etc.
If you are using begin and end dates it often indicates you are working with what is known as a type 2 dimension, or a slowly changing dimension. That may not be the case but make sure you understand slowly changing dimensions before you make your decision.
Recording both start and end dates has the advantage that you can represent non-uniform periods of time more easily. It means you can more easily join, aggregate and compare data recorded with different granularity. From your description there doesn't seem to be anything fundamentally "wrong" with what you are proposing. I've implemented similar things before.
I find that the best model for time periods in a table is to use half open intervals. I.e.: the interval is the period represented by StartDate >= x < EndDate. Half open intervals make joins and comparisons easier.
Each fact table has a grain. The grain of a fact table specifies what each row of the table represents -- one transaction or some kind of an aggregate (daily, weekly, monthly..).
I suppose your current tables are aggregates and -- as common in these cases -- each record in the aggregate table has a foreign key to the date dimension that points to the end of the period. So, for example, each record in a weekly aggregate table has one row per week and points to the last day of the week (Saturday or Sunday). Note that having another key for the start of this period would simply be redundant.
Now, if you are looking to allow for flexibility of period-reporting, then you should consider the table grain of one transaction, in other words one row in a table should be one transaction and any date/time FK points to the time of the actual transaction.
The wrong approach would be to mix grains in the same table. Consider the following
FromDateKey ToDateKey Amount
20110327 20110402 700.0
20110329 20110330 200.0
Any sum()
which would include both rows, would double-count the second entry which is already included in the first one.
To summarize, if your monthly, quarterly and yearly aggregations are not fine enough, simply introduce a fact table with a finer grain -- one day aggregate or a single transaction.
Alright. This is the way I handle (will be) the same requirements. I simulate adjustments into my fact table with a new date field that records the event date.
For example, from above
EventDateKey Amount RecordType
20110327 700.0 Source
20110329 -500.0 DW Adjustment
So if you need to aggregate (Sum the amount) your data can use the EventDateKey and work with any period through the same Date dimension. This is complicate because you are simulating a adjustment on your fact table but it gives all the flexiblity you looking for without loosing the grain of the information.
精彩评论