Deleting redundant entries from MySQL table because of overlapping dates
I have a MyISAM table of affiliations between organizations and individuals. Each record has a start and an end date. These records are added while processing large text files, so I don't do a lot of processing and cleaning as they are added in order to speed up the text parsing. However, some of the records are redundant or potentially redundant because they contain date ranges that overlap.
For instance, I could have the following:
aff_id aff_e1_id aff_e1_type aff_e2_id aff_e2_type aff_start aff_end
------ --------- ----------- --------- ----------- ----------- ----------
01 172 org 131 indiv 1997-01-22 1998-03-31
02 172 org 131 indiv 1997-01-22 1999-04-03
03 100 org 127 indiv 1995-01-02 2000-01-05
04 100 org 127 indiv 1994-01-24 1999-03-04
What I would like to do is combine the records which are redundant relationships and modify the date range to include any overlaps. For example, the first two and the last two records, respectively, could be combined and the dates modified to include both dates.
Is there a way to do this entirely within MySQL?
Edited: In response to comments below, the 2, 3, 4, 5 columns need t开发者_JAVA技巧o be identical, and then to check if the dates overlap (if they don't overlap at all, can just leave them alone).
A stored procedure would be great but is there a faster way than using a cursor to cycle through all the records and compare them one-on-one?
You can solve it with a series of delete/update statements:
- Delete all ranges that are completely within another range
- Update any ranges that have an end-date >= another range's start-date
- Repeat (assuming you could have a series of rows that overlap for the same id) until your update statement doesn't update any rows
I think you could just keep doing the update over and do the delete once at the end, but depending on how much data and how many overlaps, that may not be ideal anyway.
Delete Statement:
DELETE sub
FROM tab AS sub
INNER JOIN tab AS sup
ON sub.aff_e1_type = sup.aff_e1_type
AND sub.aff_e2_type = sup.aff_e2_type
AND sub.aff_e1_id = sup.aff_e1_id
AND sub.aff_e2_id = sup.aff_e2_id
AND ( ( sub.aff_start = sup.aff_start
AND sub.aff_end = sup.aff_end
AND sub.aff_id < sup.aff_id)
OR ( sub.aff_start > sup.aff_start
AND sub.aff_end <= sup.aff_end
AND sub.aff_id <> sup.aff_id)
OR ( sub.aff_start >= sup.aff_start
AND sub.aff_end < sup.aff_end
AND sub.aff_id <> sup.aff_id)
)
Update Statement:
UPDATE tab AS row1
INNER JOIN tab AS row2
ON row1.aff_e1_type = row2.aff_e1_type
AND row1.aff_e2_type = row2.aff_e2_type
AND row1.aff_e1_id = row2.aff_e1_id
AND row1.aff_e2_id = row2.aff_e2_id
AND row1.aff_end >= row2.aff_start
AND row1.aff_start < row2.aff_start
AND row1.aff_id <> row2.aff_id
SET row1.aff_end = row2.aff_end
One way to do this is to create a new copy of the table, copying the data over with the new groupings you want, then rename the tables to replace the old table with the new table. If the table is very large you may be better off dumping the data to disk using SELECT ... INTO OUTFILE
and then loading it into the new table using LOAD DATA INFILE
.
Here's an example of the first approach I described:
CREATE TABLE your_table_new LIKE your_table;
INSERT INTO your_table_new(aff_id, aff_e1_id, aff_e1_type, aff_e2_id, aff_e2_type,
aff_start, aff_end)
SELECT NULL as aff_id, aff_e1_id, aff_e1_type, aff_e2_id, aff_e2_type,
MIN(aff_start), MAX(aff_end)
FROM your_table
GROUP BY aff_e1_id, aff_e1_type, aff_e2_id, aff_e2_type;
RENAME TABLE your_table TO your_table_old,
your_table_new TO your_table;
精彩评论