Allowing redundant data in the DB for the sake of performance
Let's say you're designing the DB schema for the next stack overflow and more specifically the part of the schema 开发者_开发技巧that handles question ratings.
I assume you'd use a table like:
ratings(question_id, user_id, rating) ... that will both record ratings and make sure no user votes twice on the same question.That table alone could handle rating data but it might result in slow queries.
Taking performance into consideration, would you consider storing the sum of ratings for each question in the questions table, even though this data would be redundant since it's derivative from the data in the ratings table?
I would generally first start with a normalized model, not de-normalizing the sum of ratings in the question table.
Then, when the application is working well enough, I would do some performance testings, to determine whether the application handles load good enough -- compared to the load I expect to have in production.
If it doesn't handle load well enough, I would check for bottlenecks -- and correct the most important ones, until the application does well.
Once the application is in production, if the website has lots opf users, it'll be time to make some additionnal optimizations.
To make things simple :
- Don't over-optimize
- Get your application working
- Once it works, benchmark it
- If / when needed, optimize
In the end, yes, maybe, de-normalizing the sum of ratings to the questions table might help ; but do you need to do it ?
That is the real question ;-)
If you're planning to pre-aggregate tables it would be worth looking at materialised views (indexed views in T-SQL).
In general - it is valid approach to store aggregate values, if you know that data is read much more frequently then written. In this specific case I would also consider making phisical design of the answers table in the way, which makes aggregation cheap. To do so I would make clustered index defined on query_id, answer_id. As a result only several DB pages will be read from the disk to get all answers for the specific query.
精彩评论