DB Table Optimization join vs repeat columns
This is more of a preference but I was wondering what people think would be the optimal option to perform. I have a Question, Answer, and Point (because I need to track which user made the point)
开发者_如何学PythonTable Dump
Question:
id
title
Answer:
id
question_id
user_id
response
Point_Answer:
id
answer_id
user_id
points
So in this layout to get the Top Answer would require a complicated join sequence.
SELECT t2.id, t2.user_id, t2.response, MAX(points)
FROM Question as t1,
(SELECT qa.*, SUM(pa.points) as points
FROM answer as qa, Point_Answer as pa
WHERE qa.id = pa.answer_id
GROUP BY qa.id) as t2
WHERE t1.id = %s AND t1.id = t2.question_id
Where if I changed it like this:
Question:
id
title
Answer:
id
question_id
user_id
response
points
Point_Answer:
id
answer_id
user_id
points
The query would be less burdening
SELECT A.id, A.user_id, A.response, MAX(points)
FROM Question as Q, Answer as A
WHERE Q.id = %s AND Q.id = A.question_id
GROUP BY A.id
Also would mean I would have to make sure when Point_Answer is added Answer.points get added. So basically an extra UPDATE. Basically it is "Integrity vs. Redundancy" and a bit of optimization, what would the better way to go be?
It would depend on how slow the first is not the complexity of the join. It would be an extremely poor idea to do this solely becasue you don't want to write (one time) a more complex query. Performance is the only real reason to do something of this nature.
If the first is unacceptably slow, then a table or field summing the points can be an acceptable denormalization if and ONLY if you keep the field updated through a trigger not from the application (the only way to ensure accuracy of the denormalized number). You would need to test the solution including the extra update time to determine if you have indeed saved any processing time. This may depend on how often the numbers are changed. FOr instance if you add a second to the update time and save ten seconds on the select, but you 10,000 updates for every selct this is not a good optimization. However if you make a report go from an hour to millseconds and only add a millisecond to the insert or update, it might be acceptable.
There is no way to answer this without actually coding and testing both solutions with production level workload and data.
It depends on many factors, most of which depend on your setup.
The two most important factors are:
- How often you are running the query. Keep in mind that the second solution not only uses more disk space (which could theoretically decrease performance) but also requires you to take care of the denormalized structure when adding records. Although that could be automated using a trigger (depending on the RDBMS), it is still a performance overhead.
- The RDBMS you are using. Your first query may be ugly (I've seen much worse still), but are you sure it is slow? The only way to get a definitive answer to that question is to run the query and check with EXPLAIN [query] what query plan is used by your RDBMS.
So basically, I would stick to the first solution. Not having a normalized relation scheme is a good thing sometimes, but you should one denormalize your structure, if you are sure, it will give you a performance boost and if you've identified the bottleneck in your application in a production-like environment.
If the query performs reasonably well, I would leave it as is. An ugly, well-performing query beats redundancy in my book.
With the redundancy option, you need to make sure you encapsulate your update statements in a transaction to make sure everything gets updated; otherwise, you run the risk of having your data out of sync.
I've worked with some legacy apps that went the Redundancy route without transactions, and when one table doesn't get updated for whatever reason, it gets messy.
精彩评论