开发者

Compacting tables after normalisation

I recently increased the level of normalisation in my database, going from something like this:

+--------------------------------------+
| state_changes                        |
+----+-------+-----------+------+------+
| ID | Name  | Timestamp | Val1 | Val2 |
+----+-------+-----------+------+------+
| 0  | John  | 17:19:01  |  A   |  X   |
| 1  | Bob   | 17:19:02  |  E   |  W   |
| 2  | John  | 17:19:05  |  E   |  Y   |
| 3  | John  | 17:19:06  |  B   |  Y   |
| 4  | John  | 17:19:12  |  C   |  Z   |
| 5  | John  | 17:19:15  |  A   |  Z   |
+----+-------+-----------+------+------+

To something more like this:

+-------------------------------+   +-------------------------------+
| state_changes_1               |   | state_changes_2               |
+----+-------+-----------+------+   +----+-------------------+------+
| ID | Name  | Timestamp | Val1 |   | ID | Name  | Timestamp | Val2 |
+----+-------+-----------+------+   +----+-------+-----------+------+
| 0  | John  | 17:19:01  |  A   |   | 0  | John  | 17:19:01  |  X   |
| 1  | Bob   | 17:19:02  |  E   |   | 1  | Bob   | 17:19:02  |  W   |
| 2  | John  | 17:19:05  |  E   |   | 2  | John  | 17:19:05  |  Y   |
| 3  | John  | 17:19:06  |  B   |   | 3  | John  | 17:19:06  |  Y   |
| 4  | John  | 17:19:12  |  C   |   | 4  | John  | 17:19:12  |  Z   |
| 5  | John  | 17:19:15  |  A   |   | 5  | John  | 17:19:15  |  Z   |
+----+-------+-----------+------+   +----+-------+-----------+------+

How could I now write a query to "compact" the two resulting tables where values are duplicated?

  • I want to ignore the ID field when considering row uniqueness;
  • I want to ignore the Timestamp when considering 开发者_开发百科row uniqueness;
  • But fields must be sequential (under a Name,Timestamp ordering) to be considered duplicates.

The result, in this example, should be:

+-------------------------------+   +-------------------------------+
| state_changes_1               |   | state_changes_2               |
+----+-------+-----------+------+   +----+-------+-----------+------+
| ID | Name  | Timestamp | Val1 |   | ID | Name  | Timestamp | Val2 |
+----+-------+-----------+------+   +----+-------+-----------+------+
| 0  | John  | 17:19:01  |  A   |   | 0  | John  | 17:19:01  |  X   |
| 1  | Bob   | 17:19:02  |  E   |   | 1  | Bob   | 17:19:02  |  W   |
| 3  | John  | 17:19:06  |  B   |   | 2  | John  | 17:19:05  |  Y   |
| 4  | John  | 17:19:12  |  C   |   | 4  | John  | 17:19:12  |  Z   |
| 5  | John  | 17:19:15  |  A   |   +----+-------+-----------+------+
+----+-------+-----------+------+

My tables have several billion rows so I'm looking for something that takes efficiency into consideration; that said, I'm a realistic sort of person so I'm happy for the query to take an hour or two to run (including index rebuilds) if needs be.


I tried this on MySQL 5.1.58 and it seems to work with your test data.

SET @name = NULL;
SET @val1 = NULL;

UPDATE state_changes_1
SET Val1 = IF(Name=@name AND Val1=@val1, NULL, (@val1:=Val1)),
    Name = (@name:=Name)
ORDER BY Name, `Timestamp`;

DELETE FROM state_changes_1 WHERE Val1 IS NULL;


Your problem is your concept of 'sequential' or consecutive duplicate doesn't exist in relational algebra so won't be able to do it in sql. You can get easily the latest timestamp of each state by doing

SELECT id, name, MAX(timestamp) ts , state FROM states
GROUP BY id, name, state
ORDER BY ts

However, you could do what you want by dumping your table into a text file and do a simple script in which ever language you are confortable with, perl, ruby python etc. Even on a million row table that could be done quiet quickly

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜