Grouping items that were created within a certain time of each other
I have a bunch of products (500k or so) in a database that were created over the last several years and I'd like to group them together (Rails 2.3.14)
Ideally, they would be considered the same group if:
- They were created by the same company_id
- They were created within 10 minutes of each other
A rough pass at what I'm trying to accomplish:
def self.package_products
Company.each do |开发者_JAVA技巧company|
package = Package.new
products = Product.find(:all, :conditions => [:company_id = company && created_around_similar_times])
package.contents = first_few_product_descriptions
package.save!
products.update_all(:package_id => package.id)
end
end
To me it smells bad though. I don't like looping through the companies and can't help but think there's a better way to do it. Does anyone have any sql-fu that can group similar items? Basically looking to find products from the same company that were created within 10 minutes of each other and assign them the same package_id.
This is hard to to in pure SQL. I would resort to a plpgsql procedure.
Say, your table looks like this:
(Next time, be so nice as to post a table definition. Worth more than a thousand words.)
create table p (
id serial primary key -- or whatever your primary key is!
, company_id int4 NOT NULL
, create_time timestamp NOT NULL
, for_sale bool NOT NULL
);
Use a plpgsql function like this:
CREATE OR REPLACE FUNCTION f_p_group()
RETURNS void AS
$BODY$
DECLARE
g_id integer := 1;
last_time timestamp;
last_company_id integer;
r p%ROWTYPE;
BEGIN
-- If the table is huge, special settings for these parameters will help
SET temp_buffers = '100MB'; -- more RAM for temp table, adjust to actual size of p
SET work_mem = '100MB'; -- more RAM for sorting
-- create temp table just like original.
CREATE TEMP TABLE tmp_p ON COMMIT DROP AS
SELECT * FROM p LIMIT 0; -- no rows yet
-- add group_id.
ALTER TABLE tmp_p ADD column group_id integer;
-- loop through table, write row + group_id to temp table
FOR r IN
SELECT * -- get the whole row!
FROM p
-- WHERE for_sale -- commented out, after it vanished from the question
ORDER BY company_id, create_time -- group by company_id first, there could be several groups intertwined
LOOP
IF r.company_id <> last_company_id OR (r.create_time - last_time) > interval '10 min' THEN
g_id := g_id + 1;
END IF;
INSERT INTO tmp_p SELECT r.*, g_id;
last_time := r.create_time;
last_company_id := r.company_id;
END LOOP;
TRUNCATE p;
ALTER TABLE p ADD column group_id integer; -- add group_id now
INSERT INTO p
SELECT * FROM tmp_p; -- ORDER BY something?
ANALYZE p; -- table has been rewritten, no VACUUM is needed.
END;
$BODY$
LANGUAGE plpgsql;
Call once, then discard:
SELECT f_p_group();
DROP FUNCTION f_p_group();
Now, all members of a group as per your definition share a group_id
.
Edit after question edit
I put in a couple more things:
- Read the table into a temporary table (ordering in the process), do all the updates there, truncate the original table add group_id and write updated rows from the temp table in one go. Should be much faster and no vacuum needed afterwards. But you need some RAM for that
for_sale
ignored in query after it's not in the question any more.- Read about %ROWTYPE.
- Read here about work_mem and temp_buffers.
- TRUNCATE, ANALYZE, TEMP TABLE, ALTER TABLE, ... all in the fine manual
- I tested it with pg 9.0. should work in 8.4 - 9.0 and probably older versions too.
精彩评论