开发者

Algorithm for Source Control System?

I need to write a simple source control system and wonder what algorithm I would use for file differences?

I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.

Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves 开发者_如何学Golike it's own repository. No Metadata except for some permissions.

Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)


Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.

"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.


Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.

Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.


How about looking the source code of Subversion ? its licensed under Apache License 2.0


Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.


The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"

Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.

For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer

If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.

Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).


Though fossil is GPL, the delta algorithm is based on rsync and described here


I was actually thinking about something similar to this the other day... (odd, huh?)

I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.

As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜