SQLite: how to merge rows that match in alternating columns?

2023-03-30 16:10 问答作者：

So let's suppose that we're working with a sqlite table that looks (roughly) like this:

   id       date1     date2
+-------+----------+----------+
|  foo  |10/01/2010|01/01/2011|
+-------+----------+----------+
|  bar  |07/01/2010|10/01/2010|
+-------+----------+----------+
...      ...        ...

etc... I'm trying to somehow merge those rows with the same id and a combination of date1 and date2 values which happen to specify a range that would otherwise be continuous, were it not scattered over multiple rows. In other words, this:

   id       date1     date2
+-------+----------+----------+
|  foo  |07/01/2010|10/01/2010|
+-------+----------+----------+
|  foo  |10/01/2010|01/01/2011|
+-------+----------+----------+

would become:开发者_JS百科

   id       date1     date2
+-------+----------+----------+
|  foo  |07/01/2010|01/01/2011|
+-------+----------+----------+

and so on for cases where you have 3 (or more) bar's each mapped to three (or more) different, although altogether continuous, ranges. What'd such a query look like? I haven't been able to come up with any reasonable solutions so far, though I'm not much of a SQLista myself.

I realize sqlite does not support analytic functions, but... Here is a potential sql solution that uses analytic functions. I ran this in Postgresql.

CREATE TABLE test(id VARCHAR(16), date1 DATE, date2 DATE);

INSERT INTO test VALUES('foo', '2011-01-01', '2011-01-15');
INSERT INTO test VALUES('bar', '2011-01-02', '2011-01-04');
INSERT INTO test VALUES('bar', '2011-01-05', '2011-01-10'); -- not contiguous
INSERT INTO test VALUES('foo', '2011-01-25', '2011-01-30');
INSERT INTO test VALUES('foo', '2011-01-15', '2011-01-18'); -- contiguous
INSERT INTO test VALUES('foo', '2011-01-28', '2011-01-31'); -- overlap
INSERT INTO test VALUES('bar', '2011-01-07', '2011-01-08'); -- subset chopped

postgres=# SELECT * FROM test ORDER BY id, date1;
 id  |   date1    |   date2
-----+------------+------------
 bar | 2011-01-02 | 2011-01-04
 bar | 2011-01-05 | 2011-01-10
 bar | 2011-01-07 | 2011-01-08
 foo | 2011-01-01 | 2011-01-15
 foo | 2011-01-15 | 2011-01-18
 foo | 2011-01-25 | 2011-01-30
 foo | 2011-01-28 | 2011-01-31
(7 rows)

SELECT id
      ,MIN(date1) AS date1
      ,MAX(date2) AS date2
  FROM ( SELECT id, date1, date2, previous_date1, previous_date2
               ,SUM( CASE WHEN date1 > previous_date2 THEN 1 ELSE 0 END ) OVER(PARTITION BY id ORDER BY id, date1) AS group_id
           FROM ( SELECT id, date1, date2
                        ,COALESCE( LAG(date1) OVER (PARTITION BY id ORDER BY id, date1), date1 )  previous_date1
                        ,COALESCE( LAG(date2) OVER (PARTITION BY id ORDER BY id, date1), date2 )  previous_date2
                    FROM test
                    ORDER BY id, date1, date2
                ) AS x
       ) AS y
  GROUP BY id, group_id
  ORDER BY 1,2;

 id  |   date1    |   date2
-----+------------+------------
 bar | 2011-01-02 | 2011-01-04
 bar | 2011-01-05 | 2011-01-10
 foo | 2011-01-01 | 2011-01-18
 foo | 2011-01-25 | 2011-01-31
(4 rows)

EXPLANATION

Working from the inside out, first sort the rows by id and date, and add two extra columns to each row to indicate the previous row's date1 and date2 values.

 id  |   date1    |   date2    | previous_date1 | previous_date2
-----+------------+------------+----------------+----------------
 bar | 2011-01-02 | 2011-01-04 | 2011-01-02     | 2011-01-04
 bar | 2011-01-05 | 2011-01-10 | 2011-01-02     | 2011-01-04
 bar | 2011-01-07 | 2011-01-08 | 2011-01-05     | 2011-01-10
 foo | 2011-01-01 | 2011-01-15 | 2011-01-01     | 2011-01-15
 foo | 2011-01-15 | 2011-01-18 | 2011-01-01     | 2011-01-15
 foo | 2011-01-25 | 2011-01-30 | 2011-01-15     | 2011-01-18
 foo | 2011-01-28 | 2011-01-31 | 2011-01-25     | 2011-01-30
(7 rows)

Then flag every row that has an overlap (between date1 and previous_date1), summing these flags within the "id" grouping gives us a sub-grouping of the ids.

 id  |   date1    |   date2    | previous_date1 | previous_date2 | flag | group_id
-----+------------+------------+----------------+----------------+------+----------
 bar | 2011-01-02 | 2011-01-04 | 2011-01-02     | 2011-01-04     |    0 |        0
 bar | 2011-01-05 | 2011-01-10 | 2011-01-02     | 2011-01-04     |    1 |        1
 bar | 2011-01-07 | 2011-01-08 | 2011-01-05     | 2011-01-10     |    0 |        1
 foo | 2011-01-01 | 2011-01-15 | 2011-01-01     | 2011-01-15     |    0 |        0
 foo | 2011-01-15 | 2011-01-18 | 2011-01-01     | 2011-01-15     |    0 |        0
 foo | 2011-01-25 | 2011-01-30 | 2011-01-15     | 2011-01-18     |    1 |        1
 foo | 2011-01-28 | 2011-01-31 | 2011-01-25     | 2011-01-30     |    0 |        1
(7 rows)

Now we can group by id and the generated "group_id".

Maybe kind of crazy. I'm not sure that I would actually want to use this kind of solution because it could be hard to test, verify, document, and especially maintain a few years down the road. But I still think it is neat the things that can be done with sql.

Do you specifically need to do this using a (single) SQL query? If not, my advice is to take your language of choice and write a one-off script to perform this transformation of your data.

继续阅读：sql

SQLite: how to merge rows that match in alternating columns?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？