What data structures and algorithms are applied within data warehouse cubes?

2022-12-31 02:32 问答作者：

I understand that cubes are optimized data structures for aggregating and "slicing" large amounts of data. I just don't know how they are implemented.

开发者_如何学编程

I can imagine a lot of this technology is proprietary, but are there any resources that I could use to start implementing my own cube technology?

Set theory and lots of math are probably involved (and welcome as suggestions!), but I'm primarily interested in implementations: the data structures and query algorithms.

Thanks!

There is a fantastic book that describes many internal details of SSAS implementation, including storage and query mechanism details:

http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016

In a star-schema database, facts are usually acquired and stored at the finest grain.

So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx

Right now, the grain is Product, Time (at a day granularity), Store.

Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).

Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).

ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars

And you would get this by doing:

INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
    ,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
    ,sf.StoreID
    ,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
    ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month

What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.

Generally, a data warehouse uses a relational database, but the tables aren't normalized like an operational relational database.

A data warehouse is subject oriented. Data warehouse subject tables usually have the following characteristics:

Many indexes.
No joins, except to look up tables.
Duplicated data, the subject table is highly denormalized.
Contains derived and aggregated information.

The database tables in a data warehouse are arranged in a star schema. A star schema is basically one subject table with an array of look up tables. The keys of the look up tables are foreign keys in the subject table. If you draw an entity relationship diagram of the subject table, the look up tables would surround the subject table like star points.

As far as the queries, that depends on the subject tables and the number of rows. Generally, expect queries to take a long time (many minutes, sometimes hours).

Here's a general article to get you started: Developing a Data Warehouse Architecture

Here's a high level overview of the design of a star schema: Designing the Star Schema Database

继续阅读：data-warehouse sql ssas

What data structures and algorithms are applied within data warehouse cubes?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？