SQL count many-to-many values or have it counted every time new row is added?
I am using MySQL (MyISAM) 5.0.41 and I have this query:
SELECT `x`.`items`.id, `x`.`items`.name, COUNT(*) AS count
FROM `x`.`items` INNER JOIN `x`.`user_items`
ON `x`.`items`.id = `x`.`user_items`.item_id
GROUP BY name HAVING count > 2 ORDER BY count DESC
I have about 36,000 users, 175,000 user_items and 60,000 items which are constantly added to. So this query is getting a bit slow...
Is it better to:
- Have a
count
field initems
and update that periodically (say each time a us开发者_Go百科er adds an item) - or run the query like this (slowly)..
Or is there any SQL that will populate the count field for me?
Thanks
You can use an intermediate solution:
Add a
ts DATETIME
column to theuser_items
table which would describe the time the user added the itemAdd a
ts DATETIME
column to theusers
table which would describe the point of actuality, as long ascnt
, the cached count columnPeriodically update the
users
table with the new count and timestamp:INSERT INTO users (id, ts, cnt) SELECT * FROM ( SELECT user_id, NOW() AS nts, COUNT(*) AS ncnt FROM user_items ui WHERE ui.timestamp <= NOW() ) ON DUPLICATE KEY UPDATE ts = nnow, cnt = ncnt
Invalidate the user's timestamp when a
user_items
entry is deletedIssue this query to count the items:
SELECT u.id, u.cnt + ( SELECT COUNT(*) FROM user_items ui WHERE ui.ts > u.ts AND ui.user_id = u.id ) FROM users
This way, only the newly added items will be counted in the user_items
table which is much faster, and you won't have concurrency issues with updating the records too often.
You should start by indexing user_items.item_id
and grouping on it instead of name. Strings are much slower to group by (try it out for yourself), and the index should speed things up a bit more. If that still is too slow, you could run the GROUP BY
query first and then join on the items table if your DBMS execution plan isn't doing that by default.
That query is pretty much doing a full table scan every time. There is no way around that. Indexes will speed things up my speeding up the join, but the query will just get slower and slower as your data grows.
Storing summary data, like the "count" with the "items" would be the way to go. You can do this with stored procedures or through code. As a double check, you can periodically (i.e. once per day) update all counts so you know they are accurate.
My impulse would be to leave the data in something like normal form (in other words, do not increment a "count" field), and then cache the result of the slow query at the application level.
If caching is ineffective, because many people are doing the query, and few of them do it twice, then, yes, you can set up a stored procedure that automatically updates some row in some table. The details vary depending on DB vendor. Here's how to do it in Postgresql. This is the only safe way to do it (i.e., within the DB, and not from the application layer) due to race conditions.
Are you really getting all 36,000 users every time that you run your query? If you're looking to find the source of a performance issue then that could be it right there.
Depending on your RDBMS you could look at things like indexed or materialized views. Including the count as part of the table and trying to maintain it will almost certainly be a mistake, especially with the small size of your database.
精彩评论