Compacting tables after normalisation

2023-04-06 10:32 问答作者：

I recently increased the level of normalisation in my database, going from something like this:

+--------------------------------------+
| state_changes                        |
+----+-------+-----------+------+------+
| ID | Name  | Timestamp | Val1 | Val2 |
+----+-------+-----------+------+------+
| 0  | John  | 17:19:01  |  A   |  X   |
| 1  | Bob   | 17:19:02  |  E   |  W   |
| 2  | John  | 17:19:05  |  E   |  Y   |
| 3  | John  | 17:19:06  |  B   |  Y   |
| 4  | John  | 17:19:12  |  C   |  Z   |
| 5  | John  | 17:19:15  |  A   |  Z   |
+----+-------+-----------+------+------+

To something more like this:

+-------------------------------+   +-------------------------------+
| state_changes_1               |   | state_changes_2               |
+----+-------+-----------+------+   +----+-------------------+------+
| ID | Name  | Timestamp | Val1 |   | ID | Name  | Timestamp | Val2 |
+----+-------+-----------+------+   +----+-------+-----------+------+
| 0  | John  | 17:19:01  |  A   |   | 0  | John  | 17:19:01  |  X   |
| 1  | Bob   | 17:19:02  |  E   |   | 1  | Bob   | 17:19:02  |  W   |
| 2  | John  | 17:19:05  |  E   |   | 2  | John  | 17:19:05  |  Y   |
| 3  | John  | 17:19:06  |  B   |   | 3  | John  | 17:19:06  |  Y   |
| 4  | John  | 17:19:12  |  C   |   | 4  | John  | 17:19:12  |  Z   |
| 5  | John  | 17:19:15  |  A   |   | 5  | John  | 17:19:15  |  Z   |
+----+-------+-----------+------+   +----+-------+-----------+------+

How could I now write a query to "compact" the two resulting tables where values are duplicated?

I want to ignore the ID field when considering row uniqueness;
I want to ignore the Timestamp when considering 开发者_开发百科row uniqueness;
But fields must be sequential (under a Name,Timestamp ordering) to be considered duplicates.

The result, in this example, should be:

+-------------------------------+   +-------------------------------+
| state_changes_1               |   | state_changes_2               |
+----+-------+-----------+------+   +----+-------+-----------+------+
| ID | Name  | Timestamp | Val1 |   | ID | Name  | Timestamp | Val2 |
+----+-------+-----------+------+   +----+-------+-----------+------+
| 0  | John  | 17:19:01  |  A   |   | 0  | John  | 17:19:01  |  X   |
| 1  | Bob   | 17:19:02  |  E   |   | 1  | Bob   | 17:19:02  |  W   |
| 3  | John  | 17:19:06  |  B   |   | 2  | John  | 17:19:05  |  Y   |
| 4  | John  | 17:19:12  |  C   |   | 4  | John  | 17:19:12  |  Z   |
| 5  | John  | 17:19:15  |  A   |   +----+-------+-----------+------+
+----+-------+-----------+------+

My tables have several billion rows so I'm looking for something that takes efficiency into consideration; that said, I'm a realistic sort of person so I'm happy for the query to take an hour or two to run (including index rebuilds) if needs be.

I tried this on MySQL 5.1.58 and it seems to work with your test data.

SET @name = NULL;
SET @val1 = NULL;

UPDATE state_changes_1
SET Val1 = IF(Name=@name AND Val1=@val1, NULL, (@val1:=Val1)),
    Name = (@name:=Name)
ORDER BY Name, `Timestamp`;

DELETE FROM state_changes_1 WHERE Val1 IS NULL;

Your problem is your concept of 'sequential' or consecutive duplicate doesn't exist in relational algebra so won't be able to do it in sql. You can get easily the latest timestamp of each state by doing

SELECT id, name, MAX(timestamp) ts , state FROM states
GROUP BY id, name, state
ORDER BY ts

However, you could do what you want by dumping your table into a text file and do a simple script in which ever language you are confortable with, perl, ruby python etc. Even on a million row table that could be done quiet quickly

继续阅读：database-normalization

Compacting tables after normalisation

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？