Elegant PostgreSQL Group by for Ruby on Rails / ActiveRecord
Trying to retrieve an array of ActiveRecord Objects grouped by date with PostgreSQL.
More specifically I'm trying to translate the fol开发者_如何学编程lowing MySQL query:
@posts = Post.all(:group => "date(date)",
:conditions => ["location_id = ? and published = ?", @location.id, true],
:order => "created_at DESC")
I am aware that PostgreSQL interpretation of the SQL standard is stricter than MySQL and that consequently this type of query won't work...and have read a number of posts on StackOverflow and elsewhere on the subject - but none of them seem to be the definitive answer on this subject
I've tried various combinations of queries with group by and distinct clauses without much joy - and for the moment I have a rather inelegant hack which although works makes me blush when I look at it.
What is the proper way to make such a query with Rails and PostgreSQL ? (Ignoring the fact that surely this should be abstracted away at the ActiveRecord Level)
The PostgreSQL feature you want to use here is DISTINCT ON
. There are two basic ways to go about making this query via ActiveRecord.
The first method is to just specify the :select
and :order
options. This works great when you have a fairly simple query with no :joins
or :include
.
Post.all(
:select => 'DISTINCT ON (date::date) *',
:order => 'date::date DESC, created_at DESC'
)
If you have a more complex query where ActiveRecord generates its own SELECT
clause, you can use a subquery to select the target records.
Post.all(
:joins => 'INNER JOIN (SELECT DISTINCT ON (date::date) id FROM posts ORDER BY date::date DESC, created_at DESC) x ON x.id = posts.id'
)
Note that this could be a fair bit slower than the first method depending on your data. I would only use this method if required. Be sure to benchmark with production-like data.
My solution:
def self.columns_list
column_names.collect { |c| "#{table_name}.#{c}" }.join(",")
end
scope :selling, joins(:products).group(columns_list)
Simple and repeatable.
While SQL is quite simple when it comes to answering questions like "when was the most recent post for each day?" it is NOT very straight forward when you ask "which was the most recent post for each day?"
You can't retrieve the latest Post for each day without using a sub SELECT (or multiple SQL statements). This might work for you (use Post.find_by_sql or similar):
SELECT P.*, M.just_day, M.max_created_at
FROM posts P
JOIN (
SELECT date(P2.date) AS just_day, MAX(P2.created_at) AS max_created_at
FROM posts P2
P.location_id='12345' AND P.published=true
GROUP BY date(P2.date)
) AS M
ON AND M.max_created_at = P.created_at
WHERE P.location_id='12345' AND P.published=true
The above SQL statement should be enough if you can be certain that two posts will not have the same value in the created_at column. If you can't guarantee uniqueness in the created at column, then you either need to filter out the duplicates in Ruby (that shouldn't be too inefficient because presumably you will be looping over the list anyway) or you will need to do N+1 SQL statements. (Actually you could do per-row selects, but AFAIK that is just as inefficient as N+1 SQL statements.)
Here is how you could remove duplicates while looping:
last_post = nil
posts.each do |post|
unless post.just_day == last_past.try(:just_day)
# Do stuff
last_post = post
end
end
That said, you could write it nicely with just Ruby/ActiveRecord, if you have few enough days that a SELECT for each day isn't too bad:
days = Post.group("date(date)")
posts = days.each { |day| Post.order('created DESC').where("date(day) = ?", day) }
If you are using pagination (say 10 items per page), then this will require 11 SQL statements for each page. Not ideas, but the simplicity might be worth the inefficiency.
Honestly, if you expect this query to be both run frequently and with a reasonably large data set, then I suggest you add a boolean column called most_recent. The last post from past days won't change. You only need to worry about the posts from today. Just set up a cron job to run a few minutes after the end of the day to update the value for the last day. If you want something more up-to-date, you could have the cron job run every 5 minutes. Or if you need real-time, then add an after_save callback to set most_recent to false for all today's posts that aren't the current post.
This question is similar: MySQL: Getting highest score for a user
精彩评论