开发者

Deleting redundant entries from MySQL table because of overlapping dates

I have a MyISAM table of affiliations between organizations and individuals. Each record has a start and an end date. These records are added while processing large text files, so I don't do a lot of processing and cleaning as they are added in order to speed up the text parsing. However, some of the records are redundant or potentially redundant because they contain date ranges that overlap.

For instance, I could have the following:

aff_id  aff_e1_id  aff_e1_type  aff_e2_id  aff_e2_type  aff_start    aff_end
------  ---------  -----------  ---------  -----------  -----------  ----------
01       172        org            131       indiv      1997-01-22   1998-03-31
02       172        org            131       indiv      1997-01-22   1999-04-03
03       100        org            127       indiv      1995-01-02   2000-01-05
04       100        org            127       indiv      1994-01-24   1999-03-04

What I would like to do is combine the records which are redundant relationships and modify the date range to include any overlaps. For example, the first two and the last two records, respectively, could be combined and the dates modified to include both dates.

Is there a way to do this entirely within MySQL?

Edited: In response to comments below, the 2, 3, 4, 5 columns need t开发者_JAVA技巧o be identical, and then to check if the dates overlap (if they don't overlap at all, can just leave them alone).

A stored procedure would be great but is there a faster way than using a cursor to cycle through all the records and compare them one-on-one?


You can solve it with a series of delete/update statements:

  • Delete all ranges that are completely within another range
  • Update any ranges that have an end-date >= another range's start-date
  • Repeat (assuming you could have a series of rows that overlap for the same id) until your update statement doesn't update any rows

I think you could just keep doing the update over and do the delete once at the end, but depending on how much data and how many overlaps, that may not be ideal anyway.

Delete Statement:

DELETE sub
FROM tab AS sub 
INNER JOIN tab AS sup
  ON  sub.aff_e1_type = sup.aff_e1_type
  AND sub.aff_e2_type = sup.aff_e2_type
  AND sub.aff_e1_id = sup.aff_e1_id
  AND sub.aff_e2_id = sup.aff_e2_id
  AND ( ( sub.aff_start = sup.aff_start
     AND  sub.aff_end = sup.aff_end
     AND  sub.aff_id < sup.aff_id)
     OR ( sub.aff_start > sup.aff_start
     AND  sub.aff_end <= sup.aff_end
     AND  sub.aff_id <> sup.aff_id)
     OR ( sub.aff_start >= sup.aff_start
     AND  sub.aff_end < sup.aff_end
     AND  sub.aff_id <> sup.aff_id)
   )

Update Statement:

UPDATE tab AS row1 
INNER JOIN tab AS row2
  ON  row1.aff_e1_type = row2.aff_e1_type
  AND row1.aff_e2_type = row2.aff_e2_type
  AND row1.aff_e1_id = row2.aff_e1_id
  AND row1.aff_e2_id = row2.aff_e2_id
  AND row1.aff_end >= row2.aff_start
  AND row1.aff_start < row2.aff_start
  AND row1.aff_id <> row2.aff_id
SET    row1.aff_end = row2.aff_end


One way to do this is to create a new copy of the table, copying the data over with the new groupings you want, then rename the tables to replace the old table with the new table. If the table is very large you may be better off dumping the data to disk using SELECT ... INTO OUTFILE and then loading it into the new table using LOAD DATA INFILE.

Here's an example of the first approach I described:

CREATE TABLE your_table_new LIKE your_table;

INSERT INTO your_table_new(aff_id, aff_e1_id, aff_e1_type, aff_e2_id, aff_e2_type, 
  aff_start, aff_end)
SELECT NULL as aff_id, aff_e1_id, aff_e1_type, aff_e2_id, aff_e2_type, 
  MIN(aff_start), MAX(aff_end)
FROM your_table
GROUP BY aff_e1_id, aff_e1_type, aff_e2_id, aff_e2_type;

RENAME TABLE your_table TO your_table_old, 
  your_table_new TO your_table;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜