开发者

DB Table Optimization join vs repeat columns

This is more of a preference but I was wondering what people think would be the optimal option to perform. I have a Question, Answer, and Point (because I need to track which user made the point)

开发者_如何学Python

Table Dump

Question:
  id
  title

Answer:
  id
  question_id
  user_id
  response

Point_Answer:
  id
  answer_id
  user_id
  points

So in this layout to get the Top Answer would require a complicated join sequence.

SELECT t2.id, t2.user_id, t2.response, MAX(points)
FROM Question as t1,
  (SELECT qa.*, SUM(pa.points) as points
  FROM answer as qa, Point_Answer as pa
  WHERE qa.id = pa.answer_id
  GROUP BY qa.id) as t2
WHERE t1.id = %s AND t1.id = t2.question_id

Where if I changed it like this:

Question:
  id
  title

Answer:
  id
  question_id
  user_id
  response
  points

Point_Answer:
  id
  answer_id
  user_id
  points

The query would be less burdening

SELECT A.id, A.user_id, A.response, MAX(points)
FROM Question as Q, Answer as A
WHERE Q.id = %s AND Q.id = A.question_id
GROUP BY A.id

Also would mean I would have to make sure when Point_Answer is added Answer.points get added. So basically an extra UPDATE. Basically it is "Integrity vs. Redundancy" and a bit of optimization, what would the better way to go be?


It would depend on how slow the first is not the complexity of the join. It would be an extremely poor idea to do this solely becasue you don't want to write (one time) a more complex query. Performance is the only real reason to do something of this nature.

If the first is unacceptably slow, then a table or field summing the points can be an acceptable denormalization if and ONLY if you keep the field updated through a trigger not from the application (the only way to ensure accuracy of the denormalized number). You would need to test the solution including the extra update time to determine if you have indeed saved any processing time. This may depend on how often the numbers are changed. FOr instance if you add a second to the update time and save ten seconds on the select, but you 10,000 updates for every selct this is not a good optimization. However if you make a report go from an hour to millseconds and only add a millisecond to the insert or update, it might be acceptable.

There is no way to answer this without actually coding and testing both solutions with production level workload and data.


It depends on many factors, most of which depend on your setup.

The two most important factors are:

  • How often you are running the query. Keep in mind that the second solution not only uses more disk space (which could theoretically decrease performance) but also requires you to take care of the denormalized structure when adding records. Although that could be automated using a trigger (depending on the RDBMS), it is still a performance overhead.
  • The RDBMS you are using. Your first query may be ugly (I've seen much worse still), but are you sure it is slow? The only way to get a definitive answer to that question is to run the query and check with EXPLAIN [query] what query plan is used by your RDBMS.

So basically, I would stick to the first solution. Not having a normalized relation scheme is a good thing sometimes, but you should one denormalize your structure, if you are sure, it will give you a performance boost and if you've identified the bottleneck in your application in a production-like environment.


If the query performs reasonably well, I would leave it as is. An ugly, well-performing query beats redundancy in my book.

With the redundancy option, you need to make sure you encapsulate your update statements in a transaction to make sure everything gets updated; otherwise, you run the risk of having your data out of sync.

I've worked with some legacy apps that went the Redundancy route without transactions, and when one table doesn't get updated for whatever reason, it gets messy.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜