开发者

Is this a good solution to remove duplicate MySQL rows?

I saw the solution to create an alternate temporary MySQL table with unique rows, but I didn't like that idea, as my tables are very large and would be a hassle to move them (and would create huge problems if there would be errors during the move).

I did, however, find the following. What do you think of this (where the duplicates to check is "field_name")?

DELETE FROM table1
USING table1, table1 as vtable
WHERE (NOT table1.ID=vtable.ID)
AND (table1.field_name=vtable.field_name)

Somebody said this should work, but I'm not quite sure. What do you think? Also, will hav开发者_如何学JAVAing indexes at all alter the performance of this command, say, having an index on "field_name"?

EDIT: Would there be any way to test the query before running it? As far as I know, MySQL doesn't support "explain" on DELETE queries.


Note that the query you show will delete both duplicates. I would assume you want to keep one or the other.

Here's how I would write this query:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 
  ON t1.id > t2.id AND t1.field_name = t2.field_name;

By using greater-than instead of not-equals-to, you only delete one row (the later one), instead of both.

A compound index over (id, field_name) may help. You should confirm this with MySQL's EXPLAIN to get the optimization report. But EXPLAIN only supports SELECT queries so you should run an equivalent SELECT to confirm the optimization:

EXPLAIN SELECT * FROM table1 AS t1 JOIN table1 AS t2 
  ON t1.id > t2.id AND t1.field_name = t2.field_name;

You also asked about testing. I'd recommend copying a sample of rows containing duplicates to a table in your test database:

CREATE TABLE test.table1test SELECT * FROM realdb.table1 LIMIT 10000;

Now you can perform experiments on your sample data until you're satisfied the DELETE solution is correct.

USE test;
SET autocommit = 0;
DELETE ... 
ROLLBACK;

I'd recommend naming your scratch table in the test database something distinct from your real table in your real database. Just in case you run an experimental DELETE while you are accidentally still using your real database as the default database!


Re your comments:

USE test is a mysql client builtin command. It sets the test database as the default database. This will be the default database when you name tables in your queries without qualifying them with a database name. See http://dev.mysql.com/doc/refman/5.1/en/use.html

SET autocommit = 0 turns off the default behavior of committing a transaction for each query implicitly. So you must explicitly give the COMMIT or ROLLBACK command to finish a transaction. See http://dev.mysql.com/doc/refman/5.1/en/commit.html

It's worthwhile to use ROLLBACK when you're experimenting because it discards the changes made in that transaction. It's a quick way to return to the initial state of your test data so you can try another experiment.

DELETE t1 is not a typo. DELETE deletes rows, not whole tables. t1 is an alias to each row that satisfies the conditions of the statement (although it is possible that the conditions include every row in the table). See description of multi-table delete at http://dev.mysql.com/doc/refman/5.1/en/delete.html

Sort of like when you run a loop in PHP and you use a variable to iterate over the loop: for ($i=0; $i<100; ++$i) ... The variable $i takes on a series of values, and each time through the loop it has a different value.

Here's a demo showing how my solution deletes multiple duplicates. I ran this in my test database and I'm pasting the result directly from my command window:

mysql> create table table1 (id serial primary key, field_name varchar(10));
Query OK, 0 rows affected (0.45 sec)

mysql> insert into table1 (field_name) 
       values (42), (42), (42), (42), (42), (42);
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from table1;
+----+------------+
| id | field_name |
+----+------------+
|  1 | 42         | 
|  2 | 42         | 
|  3 | 42         | 
|  4 | 42         | 
|  5 | 42         | 
|  6 | 42         | 
+----+------------+
6 rows in set (0.00 sec)

mysql> delete t1 from table1 t1 join table1 t2 
       on t1.id > t2.id and t1.field_name = t2.field_name;
Query OK, 5 rows affected (0.00 sec)

mysql> select * from table1;
+----+------------+
| id | field_name |
+----+------------+
|  1 | 42         | 
+----+------------+
1 row in set (0.00 sec)


That query should work. Having indexes will alter the performance but it really depends on the size of the table.

As for testing this out, I would copy a subset of the data to a temporary table and run the command on the temp table before you run it on your real table.

Remember always back up tables before preforming any major batch jobs so you can always roll back.


The method that I use avoids a JOIN condition and should be significantly faster:

DELETE FROM table1 WHERE id NOT IN (SELECT MIN(x.id) FROM table1 AS x GROUP BY x.field_name);

The subselect gathers a list of id that you want to keep. This will allow you to keep a unique row for each field_name. The DELETE statement will then delete all extra duplicate rows.

Also, yes, the index on the field_name field will improve performance in your query.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜