SQLite: how to merge rows that match in alternating columns?
So let's suppose that we're working with a sqlite table that looks (roughly) like this:
id date1 date2
+-------+----------+----------+
| foo |10/01/2010|01/01/2011|
+-------+----------+----------+
| bar |07/01/2010|10/01/2010|
+-------+----------+----------+
... ... ...
etc... I'm trying to somehow merge those rows with the same id and a combination of date1 and date2 values which happen to specify a range that would otherwise be continuous, were it not scattered over multiple rows. In other words, this:
id date1 date2
+-------+----------+----------+
| foo |07/01/2010|10/01/2010|
+-------+----------+----------+
| foo |10/01/2010|01/01/2011|
+-------+----------+----------+
would become:开发者_JS百科
id date1 date2
+-------+----------+----------+
| foo |07/01/2010|01/01/2011|
+-------+----------+----------+
and so on for cases where you have 3 (or more) bar's each mapped to three (or more) different, although altogether continuous, ranges. What'd such a query look like? I haven't been able to come up with any reasonable solutions so far, though I'm not much of a SQLista myself.
I realize sqlite does not support analytic functions, but... Here is a potential sql solution that uses analytic functions. I ran this in Postgresql.
CREATE TABLE test(id VARCHAR(16), date1 DATE, date2 DATE);
INSERT INTO test VALUES('foo', '2011-01-01', '2011-01-15');
INSERT INTO test VALUES('bar', '2011-01-02', '2011-01-04');
INSERT INTO test VALUES('bar', '2011-01-05', '2011-01-10'); -- not contiguous
INSERT INTO test VALUES('foo', '2011-01-25', '2011-01-30');
INSERT INTO test VALUES('foo', '2011-01-15', '2011-01-18'); -- contiguous
INSERT INTO test VALUES('foo', '2011-01-28', '2011-01-31'); -- overlap
INSERT INTO test VALUES('bar', '2011-01-07', '2011-01-08'); -- subset chopped
postgres=# SELECT * FROM test ORDER BY id, date1;
id | date1 | date2
-----+------------+------------
bar | 2011-01-02 | 2011-01-04
bar | 2011-01-05 | 2011-01-10
bar | 2011-01-07 | 2011-01-08
foo | 2011-01-01 | 2011-01-15
foo | 2011-01-15 | 2011-01-18
foo | 2011-01-25 | 2011-01-30
foo | 2011-01-28 | 2011-01-31
(7 rows)
SELECT id
,MIN(date1) AS date1
,MAX(date2) AS date2
FROM ( SELECT id, date1, date2, previous_date1, previous_date2
,SUM( CASE WHEN date1 > previous_date2 THEN 1 ELSE 0 END ) OVER(PARTITION BY id ORDER BY id, date1) AS group_id
FROM ( SELECT id, date1, date2
,COALESCE( LAG(date1) OVER (PARTITION BY id ORDER BY id, date1), date1 ) previous_date1
,COALESCE( LAG(date2) OVER (PARTITION BY id ORDER BY id, date1), date2 ) previous_date2
FROM test
ORDER BY id, date1, date2
) AS x
) AS y
GROUP BY id, group_id
ORDER BY 1,2;
id | date1 | date2
-----+------------+------------
bar | 2011-01-02 | 2011-01-04
bar | 2011-01-05 | 2011-01-10
foo | 2011-01-01 | 2011-01-18
foo | 2011-01-25 | 2011-01-31
(4 rows)
EXPLANATION
Working from the inside out, first sort the rows by id and date, and add two extra columns to each row to indicate the previous row's date1 and date2 values.
id | date1 | date2 | previous_date1 | previous_date2
-----+------------+------------+----------------+----------------
bar | 2011-01-02 | 2011-01-04 | 2011-01-02 | 2011-01-04
bar | 2011-01-05 | 2011-01-10 | 2011-01-02 | 2011-01-04
bar | 2011-01-07 | 2011-01-08 | 2011-01-05 | 2011-01-10
foo | 2011-01-01 | 2011-01-15 | 2011-01-01 | 2011-01-15
foo | 2011-01-15 | 2011-01-18 | 2011-01-01 | 2011-01-15
foo | 2011-01-25 | 2011-01-30 | 2011-01-15 | 2011-01-18
foo | 2011-01-28 | 2011-01-31 | 2011-01-25 | 2011-01-30
(7 rows)
Then flag every row that has an overlap (between date1 and previous_date1), summing these flags within the "id" grouping gives us a sub-grouping of the ids.
id | date1 | date2 | previous_date1 | previous_date2 | flag | group_id
-----+------------+------------+----------------+----------------+------+----------
bar | 2011-01-02 | 2011-01-04 | 2011-01-02 | 2011-01-04 | 0 | 0
bar | 2011-01-05 | 2011-01-10 | 2011-01-02 | 2011-01-04 | 1 | 1
bar | 2011-01-07 | 2011-01-08 | 2011-01-05 | 2011-01-10 | 0 | 1
foo | 2011-01-01 | 2011-01-15 | 2011-01-01 | 2011-01-15 | 0 | 0
foo | 2011-01-15 | 2011-01-18 | 2011-01-01 | 2011-01-15 | 0 | 0
foo | 2011-01-25 | 2011-01-30 | 2011-01-15 | 2011-01-18 | 1 | 1
foo | 2011-01-28 | 2011-01-31 | 2011-01-25 | 2011-01-30 | 0 | 1
(7 rows)
Now we can group by id and the generated "group_id".
Maybe kind of crazy. I'm not sure that I would actually want to use this kind of solution because it could be hard to test, verify, document, and especially maintain a few years down the road. But I still think it is neat the things that can be done with sql.
Do you specifically need to do this using a (single) SQL query? If not, my advice is to take your language of choice and write a one-off script to perform this transformation of your data.
精彩评论