开发者

Efficient way to keep track of an aggregate amount across many columns in SQL Server

This is a long question that I do not know how to summarize...

I have a table that I need to read from that has financial data with almost a billion records of detailed data in it. I cannot change the structure of this table, I am merely a consumer of it. This table has columns such as a transaction data, a bunch of attribute columns with Int data in them describing the transaction (not named Attribute1-20, just named that way for simplicity below) and then an Amount column.

TABLE: FinancialData
COLUMNS:
  Id (BigInt IDENTITY)
  TransactionId (Int FK)
  TransactionDate (DateTime)
  Attribute1 (Int)
  Attribute2 (Int)
  .
  .
  Attribute20 (Int)
  Amount (Decimal)

I have a process that needs to summarize this FinancialData table into 2 database tables (one a header table and another a detail table with aggregated amounts) for a user-defined time-line so that a snapshot of the data can be used by other processes. The header table contains one record per user-defined time-line (snapshot) and the detail table contains aggregated amount records across all attributes of the FinancialData table.

TABLE: FinancialHeader 
COLUMNS:
  Id (Int IDENTITY)
  BeginTransactionDate (DateTime)
  EndTransactionData (DateTime)


TABLE: FinancialDetail
COLUMNS:
 Id (开发者_如何学PythonInt IDENTITY)
 FinancialHeaderId (Int FK)
 Attribute1 (Int)
 Attribute2 (Int)
 .
 .
 Attribute20 (Int)
 Amount (Decimal)

To give an example of the process, say there are 20 million records in the FinancialData table with a TransactionDate between 1/1/2010 and 6/30/2010 with many redundant attributes (however they would have different TransactionId values). If I were to summarize this data in the FinancialHeader and FinancialDetail tables above, I would create one FinancialHeader record with a BeginTransactionDate of 1/1/2010 and an EndTransactionDate of 6/30/2010 and then multiple FinancialDetail records that are child records to the header. The FinancialDetail table aggregates the 20 million records from FinancialData basically contains a grouping of the unique values of Attribute1 - Attribute20 along with a SUM(Amount) to track the total amount for those attributes. Typically 20 million records in the FinancialData table would contain about 10,000 unique combinations of attributes, which would then yield 10,000 records in the FinancialDetail table with an aggregated amount. So in my example there would be 1 FinancialHeader record and roughly 10,000 FinancialDetail records created during the process.

The question I have relates to storing 20 columns worth of unique combinations of attribute data... this "snapshot" process I am talking about can be ran over and over again by the user any number of times for various date ranges to essentially store amounts for that period in time. So what happens is the FinancialDetail table tends to have a lot of data in it even though it is aggregate data. What I don't like is the fact that there are 20 columns in the FinancialDetail table I created that I feel may be wasting space. What I was thinking may be a better approach is to store the each unique combination of attributes into a row in yet another table, say called FinancialAttribute, that contains an Id column that can then be used as a look up mechanism for the FinancialDetail table. So the FinancialAttribute table would look like this:

TABLE: FinancialAttribute
COLUMNS: 
  Id (Int IDENTITY)
  Attribute1 (Int)
  Attribute2 (Int)
  .
  .
  Attribute2 (Int)

And the FinancialDetail table would be modified into this:

TABLE: FinancialDetail
COLUMNS:
 Id (Int IDENTITY)
 FinancialHeaderId (Int FK)
 FinancialAttributeId (Int FK)
 Amount (Decimal)   

Is this a pretty common pattern to deal with aggregation across multiple columns/attributes? Or am I thinking about this in completely the wrong way? I need to store the data from the FinancialData table into my own local copy somehow though because there are several downstream processes that need to process or report on these user-defined time-line snapshots of this financial information.


This is a reasonably common approach to reduce the amount of storage occupied by reporting data - a form of it is at the core of the star-schema pattern used for data-warehouse design, in which (to simplify) measures (the financial amounts in yor scenario) are stored in fact tables, and the context-providing static data (your attribute columns) are stored in dimension tables.

An extra requirement will be to maintain the FinancialAttribute table; assuming that new combinations of attributes will appear in the data in future, you will need to add them to FinancialAttribute so that they can be linked to FinancialDetail. Depending on the implementation of your current reporting process, this may be a simple change.


Wordy question, wordy answer! ;)

I'm no data warehousing expert, so I'm not familiar with the patterns (and anti-patterns) in that area. I'm speaking as just a DB dev who has maybe done something similar.

In my case, we take snapshots from large source tables of prescription drug info. The snapshots are used for downstream analysis and reporting. The users specify snapshot criteria, like date and drug type, which usually affects 2m (vs your 20m) records. This typically compiles down to 120k (vs. your 10k) records. Snapshots are kept indefinitely, as source tables change over time and are NOT historical. I share your concerns about snapshots pulling and storing redundant information.

Your question - are you doing something dumb? Is there a better way?

Conceptually speaking, it's pretty apparent that your factoring is "safe". By this, I mean it's a straightforward transformation that's obviously correct, and it's pretty obvious how to map the factored version back into the original with little pain. From that perspective (conceptual ease), I think it has merits.

As for impact, I'd consider the expected table sizes. My assumptions are:

  • Avg snapshot records = Total unique Attribute combos per snapshot = 10k
  • Total num snapshots over time = 10k
  • Total unique Attribute combos over time = 100k
  • Amount column precision is < 20 digits

So:

FinancialDetail (orig)
Column   | Type     | Avg Size
-------------------------------
ID       | int      | 4
HeaderID | int      | 4
Amount   | decimal  | 9
A1 - A20 | int x 20 | 80
-------------------------------
Total:                97
Expected num rows:    100m
Total expected size:  9GB


FinancialDetail (new)
Column   | Type     | Avg Size
-------------------------------
ID       | int      | 4
HeaderID | int      | 4
AttribID | int      | 4
Amount   | decimal  | 9
-------------------------------
Total:                21
Expected num rows:    100m
Total expected size:  2GB

FinancialAttribute (new)
Column   | Type     | Avg Size
-------------------------------
ID       | int      | 4
A1 - A20 | int x 20 | 80
-------------------------------
Total:                84
Expected num rows:    100k
Total expected size:  8MB

If my assumptions are in the ball park (and my math right), you could be saving 78% on space. This doesn't include space for indexes or table fill slack, so the actual table sizes will be higher.

Does saving 7GB matter?

  • 9GB is easily managed by a modern disk, and is nothing compared to your master table.
  • The data is (probably) put back together in memory on a query anyway, so maybe no savings over the wire or in caching.
  • Query IO should be a little better, if seaching by Attributes
  • Backup / restore time should be better
  • Time spent reindexing, updating statistics, etc. should be better
  • (Covering) indexes based on Attributes are much smaller
  • The query time to insert into your factored tables will go up

You can make your own call on this but it seems to me that your factoring could be worthwhile if space is the #1 concern, even if it's not technically the mostest optimalest solution.

Speaking of efficiency...

If you somehow managed to optimize your Attribute space down to 0 bytes, you'd only save another 0.09% off the original. So I wouldn't futher optimize for space there.

On the other hand, simply dropping FinancialDetail.ID and using PK on (HeaderID, AttributeID) would save you 4.1% off the original. (Assumption: You don't have FKs pointing to this table.)

As far as if there's a better way - I don't know. It would depend on how many snapshots you get, how your snapshots are used, and how fast it needs to be.


Your proposed FinancialAttribute table should not contain a field for each attribute, because

  • It's inefficient storage (many of the attribute values will be NULL)
  • It's hard to find data for an attribute that is a parameter: SELECT ... WHERE attr1=@parm OR attr2=@parm OR attr3=@parm ...

You would be better normalise and create:

TABLE: FinancialAttribute
COLUMNS: 
  Id (Int IDENTITY)

TABLE: FinancialAttributeValue
COLUMNS: 
  Id (Int IDENTITY)
  FinancialAttributeID int
  Attribute Int

and if Attribute is anything more than just an identifier:

TABLE: Attribute
COLUMNS: 
  Id (Int IDENTITY)
  AttributeName varchar(50)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜