Compacting tables after normalisation
I recently increased the level of normalisation in my database, going from something like this:
+--------------------------------------+
| state_changes |
+----+-------+-----------+------+------+
| ID | Name | Timestamp | Val1 | Val2 |
+----+-------+-----------+------+------+
| 0 | John | 17:19:01 | A | X |
| 1 | Bob | 17:19:02 | E | W |
| 2 | John | 17:19:05 | E | Y |
| 3 | John | 17:19:06 | B | Y |
| 4 | John | 17:19:12 | C | Z |
| 5 | John | 17:19:15 | A | Z |
+----+-------+-----------+------+------+
To something more like this:
+-------------------------------+ +-------------------------------+
| state_changes_1 | | state_changes_2 |
+----+-------+-----------+------+ +----+-------------------+------+
| ID | Name | Timestamp | Val1 | | ID | Name | Timestamp | Val2 |
+----+-------+-----------+------+ +----+-------+-----------+------+
| 0 | John | 17:19:01 | A | | 0 | John | 17:19:01 | X |
| 1 | Bob | 17:19:02 | E | | 1 | Bob | 17:19:02 | W |
| 2 | John | 17:19:05 | E | | 2 | John | 17:19:05 | Y |
| 3 | John | 17:19:06 | B | | 3 | John | 17:19:06 | Y |
| 4 | John | 17:19:12 | C | | 4 | John | 17:19:12 | Z |
| 5 | John | 17:19:15 | A | | 5 | John | 17:19:15 | Z |
+----+-------+-----------+------+ +----+-------+-----------+------+
How could I now write a query to "compact" the two resulting tables where values are duplicated?
- I want to ignore the
ID
field when considering row uniqueness; - I want to ignore the
Timestamp
when considering 开发者_开发百科row uniqueness; - But fields must be sequential (under a
Name,Timestamp
ordering) to be considered duplicates.
The result, in this example, should be:
+-------------------------------+ +-------------------------------+
| state_changes_1 | | state_changes_2 |
+----+-------+-----------+------+ +----+-------+-----------+------+
| ID | Name | Timestamp | Val1 | | ID | Name | Timestamp | Val2 |
+----+-------+-----------+------+ +----+-------+-----------+------+
| 0 | John | 17:19:01 | A | | 0 | John | 17:19:01 | X |
| 1 | Bob | 17:19:02 | E | | 1 | Bob | 17:19:02 | W |
| 3 | John | 17:19:06 | B | | 2 | John | 17:19:05 | Y |
| 4 | John | 17:19:12 | C | | 4 | John | 17:19:12 | Z |
| 5 | John | 17:19:15 | A | +----+-------+-----------+------+
+----+-------+-----------+------+
My tables have several billion rows so I'm looking for something that takes efficiency into consideration; that said, I'm a realistic sort of person so I'm happy for the query to take an hour or two to run (including index rebuilds) if needs be.
I tried this on MySQL 5.1.58 and it seems to work with your test data.
SET @name = NULL;
SET @val1 = NULL;
UPDATE state_changes_1
SET Val1 = IF(Name=@name AND Val1=@val1, NULL, (@val1:=Val1)),
Name = (@name:=Name)
ORDER BY Name, `Timestamp`;
DELETE FROM state_changes_1 WHERE Val1 IS NULL;
Your problem is your concept of 'sequential' or consecutive duplicate doesn't exist in relational algebra so won't be able to do it in sql. You can get easily the latest timestamp of each state by doing
SELECT id, name, MAX(timestamp) ts , state FROM states
GROUP BY id, name, state
ORDER BY ts
However, you could do what you want by dumping your table into a text file and do a simple script in which ever language you are confortable with, perl, ruby python etc. Even on a million row table that could be done quiet quickly
精彩评论