What data structures and algorithms are applied within data warehouse cubes?
I understand that cubes are optimized data structures for aggregating and "slicing" large amounts of data. I just don't know how they are implemented.
开发者_如何学编程I can imagine a lot of this technology is proprietary, but are there any resources that I could use to start implementing my own cube technology?
Set theory and lots of math are probably involved (and welcome as suggestions!), but I'm primarily interested in implementations: the data structures and query algorithms.
Thanks!
There is a fantastic book that describes many internal details of SSAS implementation, including storage and query mechanism details:
http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016
In a star-schema database, facts are usually acquired and stored at the finest grain.
So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx
Right now, the grain is Product, Time (at a day granularity), Store.
Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).
Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).
ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars
And you would get this by doing:
INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
,sf.StoreID
,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month
What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.
Generally, a data warehouse uses a relational database, but the tables aren't normalized like an operational relational database.
A data warehouse is subject oriented. Data warehouse subject tables usually have the following characteristics:
Many indexes.
No joins, except to look up tables.
Duplicated data, the subject table is highly denormalized.
Contains derived and aggregated information.
The database tables in a data warehouse are arranged in a star schema. A star schema is basically one subject table with an array of look up tables. The keys of the look up tables are foreign keys in the subject table. If you draw an entity relationship diagram of the subject table, the look up tables would surround the subject table like star points.
As far as the queries, that depends on the subject tables and the number of rows. Generally, expect queries to take a long time (many minutes, sometimes hours).
Here's a general article to get you started: Developing a Data Warehouse Architecture
Here's a high level overview of the design of a star schema: Designing the Star Schema Database
精彩评论