diff 2 large database tables

2023-03-01 12:47 问答作者：

开发者_运维问答given 2 large tables(imagine hundreds of millions of rows), each one has a string column, how do you get the diff?

Check out the open-source Percona Toolkit ---specifically, the pt-table-sync utility.

Its primary purpose is to sync a MySQL table with its replica, but since its output is the set of MySQL commands necessary to reconcile the differences between two tables, it's a natural fit for comparing the two.

What it actually does under the hood is a bit complex, and it actually uses different approaches depending on what it can tell about your tables (indexes, etc.), but one of the basic ideas is that it does fast CRC32 checksums on chunks of the indexes, and if the checksums don't match, it examines those records more closely. Note that this method is much faster than walking both indexes linearly and comparing them.

It only gets you part of the way, though. Because the generated commands are intended to sync a replica with its master, they simply replace the current contents of the replica for all differing records. In other words, the commands generated modify all fields in the record (not just the ones that have changed). So once you use pt-table-sync to find the diffs, you'd need to wrap the results in something to examine the differing records by comparing each field in the record.

But pt-table-sync does what you already knew to be the hard part: detecting diffs, really fast. It's written in Perl; the source should provide good breadcrumbs.

I'd think about creating an index on that column in each DB, then using a program to process through each DB in parallel using an ordering on that column. It would advance in both as you have records that are equal, and in one or the other as you find they are out of sync (keeping track of the out of sequence records). The creation of the index could be very costly in terms of both time and space (at least initially). Keeping it updated, though, if you are going to continue adding records may not add to much overhead. Once you have the index in place you should be able to process the difference in linear time. Producing the index -- assuming you have enough space -- should be an O(nlogn) operation.

继续阅读：algorithm database

diff 2 large database tables

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？