开发者

SQL: Joins vs Denormalization (lots of data)

I know, variations of this question had been asked before. But my case may be a little different :-)

So, I am building a site that tracks events. Each event has id and value. 开发者_高级运维It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)

I need to be able to quickly get answers to two queries:

  • get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
  • get sum(maybe avg also) of values of events from users with certain profile -

Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.

Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.

I expect LOTS of data, certainly more than one table or single server can handle.

I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines). This way, every such table will have limited amount of rows, let's say, 10M tops.

So, the question is: what to do with user's attributes?

Option 1, normalized: store them in separate table and reference from event tables.

  • (pro) No repetition of data.
  • (con) joins, which are expensive (or so I heard).
  • (con) this requires user table and event tables to be on the same server

Option 2, redundant: store user attributes in event tables and index them.

  • (pro) easier load balancing (self-contained tables can be moved around)
  • (pro) simpler (faster?) queries
  • (con) lots of disk space and memory used for repeating user attributes and corresponding indexes


Your design should be normalized, you physical schema may end up denormalized for performance reasons.

Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.

Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.

In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.


Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.

Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.


Please Normalize

use partitions and indexing to balance load

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜