Diffing more quickly

2023-02-03 06:56 问答作者：

I'm working on diffing large binary files. I've implemented the celebrated Myers Diff algorithm, which produces a minimal diff. However, it is O(ND), so to diff two very different 1 MB files, I expect to take time 1 million squared = 1 trillion. That's not good!

What I'd like is an algorithm that produces a potentially non-minimal diff, but does it much faster. I know that one must exist, because Beyond Compare does it. But I don't know how!

To be sure: There are tools like xdelta or bdiff, but these produce a patch meant for computer consumption, which is different than a human-consumable diff. A patch is concerned with transforming one file into another, so it can do things like copying from previous parts of the file. A human-consumable diff is there to visually show the differences, and can only insert and delete. For example, this transfor开发者_StackOverflow社区m:

"puddi" -> "puddipuddipuddi"

would produce a small patch of "copy [0,4] to [5,9] and to [10, 14]", but a larger diff of "append 'puddipuddi'". I'm interested in algorithms that produce the larger diff.

Thanks!

Diffing is basically the same algorithm as is used in bioinformatics to align DNA sequences. These sequences are often large (millions or billions of nucleotides long), and one strategy that works well there on longer genomes is used by the program MUMmer:

Quickly find all Maximal Unique Matches (substrings that appear in both files and which cannot be extended in either direction with that condition still holding) using a suffix tree
Quickly find the longest subset of MUMs that appear in consecutive order in both files using a longest-increasing-subsequence dynamic programming algorithm
Fix this subset of MUMs in the alignment (i.e. mark those regions as matching)
If deemed necessary, perform slower (e.g. Myers) diffing on the inter-MUM regions. In your case, you would probably omit this step entirely if you found the length of the longest MUM was beneath some threshold (which you would take to be evidence that the 2 files are unrelated).

This tends to give a very good (though not guaranteed-optimal) set of aligned regions (or equivalently, a very small set of differences) whenever there are not too many differences. I'm not certain of the exact time bounds for each step, but I know that there are no n^2 or higher terms.

I believe the MUMmer program requires DNA or protein sequences, so it may not work out of the box for you, but the concepts certainly apply to general strings (e.g. files) so if you're prepared to reimplement it yourself I would recommend this approach.

From a performance standpoint as file size grows, GNU Diffutils is probably the most robust option. For your situation I'd probably use it's side-by-side comparison format, which is probably the most human friendly of the lot. Elsewise you're off taking its output in another format and doing some work to make it pretty .

A good contender, whose performance has been improving steadily, including numerous speedups, is diff-match-patch. It implements the Myers Diff algorithm in several different languages including Java and JavaScript. See the online demo for an example of the latter with pretty printed results. If you want to do line diffing study the wiki for tips there on how to use it for that purpose.

继续阅读：algorithm diff

Diffing more quickly

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？