SQL: Joins vs Denormalization (lots of data)

2023-04-12 15:04 问答作者：

I know, variations of this question had been asked before. But my case may be a little different :-)

So, I am building a site that tracks events. Each event has id and value. 开发者_高级运维It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)

I need to be able to quickly get answers to two queries:

get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -

Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.

Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.

I expect LOTS of data, certainly more than one table or single server can handle.

I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines). This way, every such table will have limited amount of rows, let's say, 10M tops.

So, the question is: what to do with user's attributes?

Option 1, normalized: store them in separate table and reference from event tables.

(pro) No repetition of data.
(con) joins, which are expensive (or so I heard).
(con) this requires user table and event tables to be on the same server

Option 2, redundant: store user attributes in event tables and index them.

(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes

Your design should be normalized, you physical schema may end up denormalized for performance reasons.

Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.

Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.

In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.

Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.

Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.

Please Normalize

use partitions and indexing to balance load

继续阅读：bigdata join sql

SQL: Joins vs Denormalization (lots of data)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？