What is the best way to merge 2 MySQL data dumps?
We have built an application with MySQL as the database. Every week we export the data dump from the database, and d开发者_如何学编程elete all the data. Now we want to merge all these dumps together for some data-analysis tasks.
The problem we are facing is that the "id" field for all the tables is Auto-Increment, so it starts with 1 in all the data dumps, which causes duplicate IDs in the table. I am sure there must be better ways to do it since it should be a pretty common task in MySQL administration.
What would be the best way to go about it?
If you can easily identify your foreign key fields (like they take the form *_id) then you can use the scripting language of your choice to modify the primary and foreign keys in the dump files by adding an "id space offset".
For example let's say you have two dump files and you know their primary key range does not exceed 1,000,000, you increment the primary and foreign keys in the second dump file by 1,000,000.
This is not entirely trivial to implement, as you will have to detect the position of the foreign key fields in the statements and then modify values at the same column position elsewhere in the statement.
If your foreign keys are not easily identifiable by a common naming convention then you must keep separate information per table about how to find their positions based on column position.
Good luck.
The best way would be that you have another database that acts as data warehouse into which you copy the contents of your app's database. After that, you don't truncate all the tables, you simply use DELETE FROM tablename - that way, your auto_increments won't get reset.
It's an ugly solution to have something exported, then truncate the database, then expect an import will proceed properly. Even if you go around the problem of clashing auto increments (there's ON DUPLICATE KEY statement that allows you to do something if a unique key constraint fails), nothing guarantees that relations between tables (foreign keys) will be preserved.
This is a broad topic and solution given is quick and not nice, some other people will probably suggest other methods, but if you are doing this to offload the db your app uses - it's a bad design. Try to google MySQL's partitioning support if you're aiming for better performance with larger data set.
For the data you've already dumped, load it into a table that doesn't use the ID column as a primary key. You don't have to define any primary key. You will have multiple rows with the same ID, but that won't impede your data analysis.
Going forward, you can set up a discipline where you dump and then DELETE the rows that are more than, say, one day old. That way the your ID will keep incrementing.
Or, you can copy this data to a table that uses the ARCHIVE storage engine. This is good for retaining data for analysis, because it compresses its contents.
精彩评论