diff 2 large database tables
开发者_运维问答given 2 large tables(imagine hundreds of millions of rows), each one has a string column, how do you get the diff?
Check out the open-source Percona Toolkit ---specifically, the pt-table-sync utility.
Its primary purpose is to sync a MySQL table with its replica, but since its output is the set of MySQL commands necessary to reconcile the differences between two tables, it's a natural fit for comparing the two.
What it actually does under the hood is a bit complex, and it actually uses different approaches depending on what it can tell about your tables (indexes, etc.), but one of the basic ideas is that it does fast CRC32 checksums on chunks of the indexes, and if the checksums don't match, it examines those records more closely. Note that this method is much faster than walking both indexes linearly and comparing them.
It only gets you part of the way, though. Because the generated commands are intended to sync a replica with its master, they simply replace the current contents of the replica for all differing records. In other words, the commands generated modify all
fields in the record (not just the ones that have changed). So once you use pt-table-sync
to find the diffs, you'd need to wrap the results in something to examine the differing records by comparing each field in the record.
But pt-table-sync
does what you already knew to be the hard part: detecting diffs, really fast. It's written in Perl; the source should provide good breadcrumbs.
I'd think about creating an index on that column in each DB, then using a program to process through each DB in parallel using an ordering on that column. It would advance in both as you have records that are equal, and in one or the other as you find they are out of sync (keeping track of the out of sequence records). The creation of the index could be very costly in terms of both time and space (at least initially). Keeping it updated, though, if you are going to continue adding records may not add to much overhead. Once you have the index in place you should be able to process the difference in linear time. Producing the index -- assuming you have enough space -- should be an O(nlogn) operation.
精彩评论