Character diff/annotation algorithm

2023-02-12 21:03 问答作者：

I have a set of string representing the history of a document. Each string is the whole document - there has not yet been any di开发者_JS百科ff analysis.

I need a relatively efficient algorithm to allow me to annotate substrings of the document with the version they came from.

For example, if the document history was like this:

Rev1: The quiet fox
Rev2: The quiet brown fox
Rev3: The quick brown fox

The algorithm would give:

The quick brown fox
1111111331222222111

i.e. "The qui" was added in revision 1, "ck" was added in revision 3, " " was added in revision 1, "brown " was added in revision 2 and finally "fox" was added in revision 1.

I have a class library that can do this easily, though I don't know how well it performs performance-wise with large or many such revisions.

The library is here: DiffLib on CodePlex (you can also install it through NuGet.)

The script for your example in the question is here (you can run this in LINQPad if you add a reference to the DiffLib assembly):

void Main()
{
    var revs = new string[]
    {
        "The quiet fox",
        "The quiet brown fox",
        "The quick brown fox",
        "The quick brown fox.",
        "The quick brown fox jumped over the lazy dog.",
        "The quick brown fox jumped over the lazy cat.",
        "The Quick Brown Fox jumped over the Lazy Cat.",
    };

    string current = revs[0];
    List<int> owner = new List<int>();
    foreach (char c in current)
        owner.Add(1); // owner 1 owns entire string

    Action<int> dumpRev = delegate(int rev)
    {
        Debug.WriteLine("rev " + rev);
        Debug.WriteLine(current);
        Debug.WriteLine(new string(owner.Select(i => (char)(48 + i)).ToArray()));
        Debug.WriteLine("");
    };
    dumpRev(0);

    for (int index = 1; index < revs.Length; index++)
    {
        int ownerId = index + 1;
        var diff = new DiffLib.Diff<char>(current, revs[index]).ToArray();
        int position = 0;
        foreach (var part in diff)
        {
            if (part.Equal)
                position += part.Length1;
            else
            {
                // get rid of old owner for the part that was
                // removed or replaced
                for (int index2 = 0; index2 < part.Length1; index2++)
                    owner.RemoveAt(position);

                // insert new owner for the part that was
                // added or did replace the old text
                for (int index2 = 0; index2 < part.Length2; index2++)
                    owner.Insert(position, ownerId);
                position += part.Length2;
            }
        }
        current = revs[index];
        dumpRev(index);
    }
}

The output:

rev 0
The quiet fox
1111111111111

rev 1
The quiet brown fox
1111111111222222111

rev 2
The quick brown fox
1111111331222222111

rev 3
The quick brown fox.
11111113312222221114

rev 4
The quick brown fox jumped over the lazy dog.
111111133122222211155555555555555555555555554

rev 5
The quick brown fox jumped over the lazy cat.
111111133122222211155555555555555555555556664

rev 6
The Quick Brown Fox jumped over the Lazy Cat.
111171133172222271155555555555555555755557664

You want to use the Myers diff algorithm as implemented by Google. It's fairly fast and has implementations in lots of languages, and you can provide timeout values to keep it from wasting too much time searching for complicated differences.

The output should be pretty trivially converted to the sort of scoring that you want (credit assignment patch by patch).

Doesn't your "history" format already provide that information? If so, then its just a matter of displaying it. The most efficient method would depend on the format your history is stored in, of course, so nobody here can really provide that for you without knowing that format.

It should be noted that if you are sending the output to some kind of display device (eg: the screen), then generally your algorithim would have to be really dumb to slow things down much more than the display device will already be slowing things down.

继续阅读：algorithm diff

Character diff/annotation algorithm

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？