开发者

COUNT(*) WHERE vs. SELECT(*) WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a 开发者_如何转开发post and used a SELECT query to find it?

How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?


If you keep a record of how many post a user made, it will definitely be faster.

If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.


Storing precalculated values is a common and simple, but very efficient sort of optimization.

So just add the column with amount of comments user has posted and maintain it with triggers or by your application.

The performance difference is:

  • With COUNT(*) you always will have index lookup + counting of results
  • With additional field you'll have index lookup + returning of a number (that already has an answer).

And there will be no significant difference between myisam and innodb in this case


Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.


If you have the proper index on user_id, then COUNT(user_id) is trivial.

It's also the correct approach, semantically.


this is really one of those 'trade off' questions.

Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.

If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:

1) having a separate audit table, then there is an overhead when adding, updating a post 2) not having a separate audit table, then overhead in querying the table directly

My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.

At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.

If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.


Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.


there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:

you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:

count = memcache.get('article-count-' + user_id)
if count is None:
    count = database.execute('select ..... where user_id = ' + user_id)
    memcache.put('article-count-' + user_id, count)

and you would also need, when a user makes a new post

memcache.delete('article-count-' + user_id)

this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.

a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).

more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜