开发者

Collapsing multiple subqueries into one in Postgres

I have two tables:

CREATE TABLE items
(
  root_id integer NOT NULL,
  id serial NOT NULL,
  -- Other fields...

  CONSTRAINT items_pkey PRIMARY KEY (root_id, id)
)

CREATE TABLE votes
(
  root_id integer NOT NULL,
  item_id integer NOT NULL,
  user_id integer NOT NULL,
  type smallint NOT NULL,
  direction smallint,

  CONSTRAINT votes_pkey PRIMARY KEY (root_id, item_id, user_id, type),
  CONSTRAINT votes_root_id_fkey FOREIGN KEY (root_id, item_id)
      REFERENCES items (root_id, id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE,
  -- Other constraints...
)

I'm trying to, in a single query, pull out all items of a particular root_id along with a few arrays of user_ids of the users who voted in particular ways. The following query does what I need:

SELECT *,
  ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = 1) as upvoters,
  ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = -1) as downvoters,
  ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 1
ORDER BY id

The problem is that I'm using three subqueries to get the information I need when it seems like I should be able to do the same in one. I thought that Postgres (I'm using 8.4) might be smart enough to collapse them all into a single query for me, but looking at the explain output in pgAdmin it looks like that's not happening - it's running multiple primary key lookups on the votes table instead. I feel like I could rework this query to be more efficient, but I'm not sure how.

Any pointers?

EDIT: An update to explain where I am now. At the advice of the pgsql-general mailing list, I tried changing the query to use a CTE:

WITH v AS (
  SELECT item_id, type, direction, array_agg(user_id) as user_ids
  FROM votes
  WHERE root_id = 5305
  GROUP BY type, direction, item_id
  ORDER BY type, direction, item_id
)
SELECT *,
  (SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = 1) as upvoters,
  (SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = -1) as downvoters,
  (SELECT user_ids from v where item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 5305
ORDER BY id

Benchmarking each of these from my application (I set up each as a prepared statement to avoid spending time on query planning, and then ran each one several thousand times with a variety of root_ids) my initial approach averages 15 milliseconds and the CTE approach averages 17 milliseconds. I was able to repeat this result over a few runs.

When I have some time I'm going to play with jkebinger's and Dragontamer5788's approaches with my test data and see how they work, but I'm also starting a bounty to see if I can get more suggestions.

I should also mention that I'm open to changing my schema (the system isn't in production yet, and won't be for a couple months) if it can speed up this query. I designed my votes table this way to take advantage of the primary key's uniqueness constraint - a given user can both favorite and upvote an item, for example, but not upvote it AND downvote it - but I can relax/work around that constraint if representing these options in a different way makes more sense.

EDIT #2: I've benchmarked all four solutions. Amazingly, Sequel is flexible enough that I was able to write all four without dropping to SQL once (not even for the CASE statements). Like before, I ran them all as prepared statements, so that query planning time wouldn't be an issue, and did each run several thousand times. Then I ran all the queries under two situations - a worst-case scenario with a lot of rows (265 items and 4911 votes) where the relevant rows would be in the cache pretty quickly, so CPU usage should be the deciding factor and a more realistic scenario where a random root_id was chosen for each run. I wound up with:

Original query  - Typical: ~10.5 ms, Worst case: ~26 ms
CTE query       - Typical: ~16.5 ms, Worst case: ~70 ms
Dragontamer5788 - Typical: ~15 ms,   Worst case: ~36 ms
jkebinger       - Typical: ~42 ms,   Worst case: ~180 ms

I suppose the lesson to take from this right now is that Postgres' query planner is very smart and is probably doing something clever under the surface. I don't think I'm going to spend any more time trying to work around it. If anyone would like to submit another query attempt I'd be happy to开发者_开发问答 benchmark it, but otherwise I think Dragontamer is the winner of the bounty and correct (or closest to correct) answer. Unless someone else can shed some light on what Postgres is doing - that would be pretty cool. :)


There are two questions being asked:

  1. A syntax to collapse multiple subqueries into one.
  2. Optimization.

For #1, I can't get the "complete" thing into a single Common Table Expression, because you're using a correlated subquery on each item. Still, you might have some benefits if you used a common table expression. Obviously, this will depend on the data, so please benchmark to see if it would help.

For #2, because there are three commonly accessed "classes" of items in your table, I expect partial indexes to increase the speed of your query, regardless of whether or not you were able to increase the speed due to #1.

First, the easy stuff then. To add a partial index to this table, I'd do:

CREATE INDEX upvote_vote_index ON votes (type, direction)
WHERE (type = 0 AND direction = 1);

CREATE INDEX downvote_vote_index ON votes (type, direction)
WHERE (type = 0 AND direction = -1);

CREATE INDEX favoriters_vote_index ON votes (type)
WHERE (type = 1);

The smaller these indexes, the more efficient your queries will be. Unfortunately, in my tests, they didn't seem to help :-( Still, maybe you can find a use of them, it depends greatly on your data.


As for an overall optimization, I'd approach the problem differently. I'd "unroll" the query into this form (using an inner join and using conditional expressions to "split up" the three types of votes), and then use "Group By" and the "array" aggregate operator to combine them. IMO, I'd rather change my application code to accept it in the "unrolled" form, but if you can't change the application code, then the "group by"+aggregate function ought to work.

SELECT array_agg(v.user_id), -- array_agg(anything else you needed), 
    i.root_id, i.id, -- I presume you needed the primary key?
CASE
    WHEN v.type = 0 AND v.direction = 1
        THEN 'upvoter'
    WHEN v.type = 0 AND v.direction = -1
        THEN 'downvoter'
    WHEN v.type = 1
        THEN 'favoriter'
END as vote_type
FROM items i 
    JOIN votes v ON i.root_id = v.root_id AND i.id = v.item_id
WHERE i.root_id = 1 
  AND ((type=0 AND (direction=1 OR direction=-1)) 
       OR type=1)
GROUP BY i.root_id, i.id, vote_type
ORDER BY id

Its still "one step unrolled" compared to your code (vote_type is vertical, while in your case, its horizontal, across the columns). But this seems to be more efficient.


Just a guess, but maybe it could be worth trying:

Maybe sql can optimize the query if you create a VIEW of

SELECT user_id from votes where root_id = i.root_id AND item_id = i.id

and then select 3 times from there with the different where-clauses about type and direction.

If thats not helping either, maybe you could fetch the 3 types as additional boolean columns and then only work with one query?

Would be interested to hear, if you find a solution. Good luck.


Here's another approach. It has the (possibly) undesirable result of including NULL values in the arrays, but it works in one pass, rather than three. I find it helpful to think of some SQL queries in a map-reduce manner, and case statements are great for that.

select
v.root_id, v.item_id,
array_agg(case when type = 0 AND direction = 1 then user_id else NULL end) as upvoters,
array_agg(case when type = 0 AND direction = -1 then user_id else NULL end) as downvoters,
array_agg(case when type = 1 then user_id else NULL end) as favoriters
from items i
join votes v on i.root_id = v.root_id AND i.id = v.item_id
group by 1, 2

With some sample data, I get this result set:

 root_id | item_id |    upvoters    |    downvoters    |    favoriters    
---------+---------+----------------+------------------+------------------
       1 |       2 | {100,NULL,102} | {NULL,101,NULL}  | {NULL,NULL,NULL}
       2 |       4 | {100,NULL,101} | {NULL,NULL,NULL} | {NULL,100,NULL}

I believe you need postgres 8.4 to get array_agg, but there's been a recipe for a array_accum function prior to that.

There's a discussion on postgres-hackers list about how to build a NULL-removing version of array_agg if you're interested.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜