开发者

SQLite: how to merge rows that match in alternating columns?

So let's suppose that we're working with a sqlite table that looks (roughly) like this:

   id       date1     date2
+-------+----------+----------+
|  foo  |10/01/2010|01/01/2011|
+-------+----------+----------+
|  bar  |07/01/2010|10/01/2010|
+-------+----------+----------+
...      ...        ...

etc... I'm trying to somehow merge those rows with the same id and a combination of date1 and date2 values which happen to specify a range that would otherwise be continuous, were it not scattered over multiple rows. In other words, this:

   id       date1     date2
+-------+----------+----------+
|  foo  |07/01/2010|10/01/2010|
+-------+----------+----------+
|  foo  |10/01/2010|01/01/2011|
+-------+----------+----------+

would become:开发者_JS百科

   id       date1     date2
+-------+----------+----------+
|  foo  |07/01/2010|01/01/2011|
+-------+----------+----------+

and so on for cases where you have 3 (or more) bar's each mapped to three (or more) different, although altogether continuous, ranges. What'd such a query look like? I haven't been able to come up with any reasonable solutions so far, though I'm not much of a SQLista myself.


I realize sqlite does not support analytic functions, but... Here is a potential sql solution that uses analytic functions. I ran this in Postgresql.

CREATE TABLE test(id VARCHAR(16), date1 DATE, date2 DATE);

INSERT INTO test VALUES('foo', '2011-01-01', '2011-01-15');
INSERT INTO test VALUES('bar', '2011-01-02', '2011-01-04');
INSERT INTO test VALUES('bar', '2011-01-05', '2011-01-10'); -- not contiguous
INSERT INTO test VALUES('foo', '2011-01-25', '2011-01-30');
INSERT INTO test VALUES('foo', '2011-01-15', '2011-01-18'); -- contiguous
INSERT INTO test VALUES('foo', '2011-01-28', '2011-01-31'); -- overlap
INSERT INTO test VALUES('bar', '2011-01-07', '2011-01-08'); -- subset chopped

postgres=# SELECT * FROM test ORDER BY id, date1;
 id  |   date1    |   date2
-----+------------+------------
 bar | 2011-01-02 | 2011-01-04
 bar | 2011-01-05 | 2011-01-10
 bar | 2011-01-07 | 2011-01-08
 foo | 2011-01-01 | 2011-01-15
 foo | 2011-01-15 | 2011-01-18
 foo | 2011-01-25 | 2011-01-30
 foo | 2011-01-28 | 2011-01-31
(7 rows)

SELECT id
      ,MIN(date1) AS date1
      ,MAX(date2) AS date2
  FROM ( SELECT id, date1, date2, previous_date1, previous_date2
               ,SUM( CASE WHEN date1 > previous_date2 THEN 1 ELSE 0 END ) OVER(PARTITION BY id ORDER BY id, date1) AS group_id
           FROM ( SELECT id, date1, date2
                        ,COALESCE( LAG(date1) OVER (PARTITION BY id ORDER BY id, date1), date1 )  previous_date1
                        ,COALESCE( LAG(date2) OVER (PARTITION BY id ORDER BY id, date1), date2 )  previous_date2
                    FROM test
                    ORDER BY id, date1, date2
                ) AS x
       ) AS y
  GROUP BY id, group_id
  ORDER BY 1,2;

 id  |   date1    |   date2
-----+------------+------------
 bar | 2011-01-02 | 2011-01-04
 bar | 2011-01-05 | 2011-01-10
 foo | 2011-01-01 | 2011-01-18
 foo | 2011-01-25 | 2011-01-31
(4 rows)

EXPLANATION

Working from the inside out, first sort the rows by id and date, and add two extra columns to each row to indicate the previous row's date1 and date2 values.

 id  |   date1    |   date2    | previous_date1 | previous_date2
-----+------------+------------+----------------+----------------
 bar | 2011-01-02 | 2011-01-04 | 2011-01-02     | 2011-01-04
 bar | 2011-01-05 | 2011-01-10 | 2011-01-02     | 2011-01-04
 bar | 2011-01-07 | 2011-01-08 | 2011-01-05     | 2011-01-10
 foo | 2011-01-01 | 2011-01-15 | 2011-01-01     | 2011-01-15
 foo | 2011-01-15 | 2011-01-18 | 2011-01-01     | 2011-01-15
 foo | 2011-01-25 | 2011-01-30 | 2011-01-15     | 2011-01-18
 foo | 2011-01-28 | 2011-01-31 | 2011-01-25     | 2011-01-30
(7 rows)

Then flag every row that has an overlap (between date1 and previous_date1), summing these flags within the "id" grouping gives us a sub-grouping of the ids.

 id  |   date1    |   date2    | previous_date1 | previous_date2 | flag | group_id
-----+------------+------------+----------------+----------------+------+----------
 bar | 2011-01-02 | 2011-01-04 | 2011-01-02     | 2011-01-04     |    0 |        0
 bar | 2011-01-05 | 2011-01-10 | 2011-01-02     | 2011-01-04     |    1 |        1
 bar | 2011-01-07 | 2011-01-08 | 2011-01-05     | 2011-01-10     |    0 |        1
 foo | 2011-01-01 | 2011-01-15 | 2011-01-01     | 2011-01-15     |    0 |        0
 foo | 2011-01-15 | 2011-01-18 | 2011-01-01     | 2011-01-15     |    0 |        0
 foo | 2011-01-25 | 2011-01-30 | 2011-01-15     | 2011-01-18     |    1 |        1
 foo | 2011-01-28 | 2011-01-31 | 2011-01-25     | 2011-01-30     |    0 |        1
(7 rows)

Now we can group by id and the generated "group_id".

Maybe kind of crazy. I'm not sure that I would actually want to use this kind of solution because it could be hard to test, verify, document, and especially maintain a few years down the road. But I still think it is neat the things that can be done with sql.


Do you specifically need to do this using a (single) SQL query? If not, my advice is to take your language of choice and write a one-off script to perform this transformation of your data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜