Grouping items that were created within a certain time of each other

2023-04-10 11:45 问答作者：

I have a bunch of products (500k or so) in a database that were created over the last several years and I'd like to group them together (Rails 2.3.14)

Ideally, they would be considered the same group if:

They were created by the same company_id
They were created within 10 minutes of each other

A rough pass at what I'm trying to accomplish:

def self.package_products
  Company.each do |开发者_JAVA技巧company|
   package = Package.new
   products = Product.find(:all, :conditions => [:company_id = company && created_around_similar_times])
   package.contents = first_few_product_descriptions
   package.save!
   products.update_all(:package_id => package.id)
 end
end

To me it smells bad though. I don't like looping through the companies and can't help but think there's a better way to do it. Does anyone have any sql-fu that can group similar items? Basically looking to find products from the same company that were created within 10 minutes of each other and assign them the same package_id.

This is hard to to in pure SQL. I would resort to a plpgsql procedure.
Say, your table looks like this:
_{(Next time, be so nice as to post a table definition. Worth more than a thousand words.)}

create table p (
  id serial primary key     -- or whatever your primary key is!
, company_id int4 NOT NULL
, create_time timestamp NOT NULL
, for_sale bool NOT NULL
);

Use a plpgsql function like this:

CREATE OR REPLACE FUNCTION f_p_group()
  RETURNS void AS
$BODY$
DECLARE
    g_id             integer := 1;
    last_time        timestamp;
    last_company_id  integer;
    r                p%ROWTYPE;
BEGIN

-- If the table is huge, special settings for these parameters will help
SET temp_buffers = '100MB';   -- more RAM for temp table, adjust to actual size of p
SET work_mem = '100MB';       -- more RAM for sorting

-- create temp table just like original.
CREATE TEMP TABLE tmp_p ON COMMIT DROP AS
SELECT * FROM p LIMIT 0;      -- no rows yet

-- add group_id.
ALTER TABLE tmp_p ADD column group_id integer;

-- loop through table, write row + group_id to temp table
FOR r IN
    SELECT *                  -- get the whole row!
      FROM p
--   WHERE for_sale       -- commented out, after it vanished from the question
     ORDER BY company_id, create_time -- group by company_id first, there could be several groups intertwined

LOOP
    IF r.company_id <> last_company_id OR (r.create_time - last_time) > interval '10 min' THEN
        g_id := g_id + 1;
    END IF;

    INSERT INTO tmp_p SELECT r.*, g_id;

    last_time       := r.create_time;
    last_company_id := r.company_id;
END LOOP;

TRUNCATE p;
ALTER TABLE p ADD column group_id integer; -- add group_id now

INSERT INTO p
SELECT * FROM tmp_p;          -- ORDER BY something?

ANALYZE p;                    -- table has been rewritten, no VACUUM is needed.

END;
$BODY$
  LANGUAGE plpgsql;

Call once, then discard:

SELECT f_p_group();

DROP FUNCTION f_p_group();

Now, all members of a group as per your definition share a group_id.

Edit after question edit

I put in a couple more things:

Read the table into a temporary table (ordering in the process), do all the updates there, truncate the original table add group_id and write updated rows from the temp table in one go. Should be much faster and no vacuum needed afterwards. But you need some RAM for that
for_sale ignored in query after it's not in the question any more.
Read about %ROWTYPE.
Read here about work_mem and temp_buffers.
TRUNCATE, ANALYZE, TEMP TABLE, ALTER TABLE, ... all in the fine manual
I tested it with pg 9.0. should work in 8.4 - 9.0 and probably older versions too.

继续阅读：activerecord postgresql ruby-on-rails-2

Grouping items that were created within a certain time of each other

Edit after question edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Edit after question edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？