Detect largest suffix of some file that is prefix of another file

2023-02-20 19:12 问答作者：

I have two files - let's call them file0 and file1.

What I would like t开发者_StackOverflowo get is a fast algorithm for the following problem (it is clear to me how to write a rather slow algorithm that solves it):

Detect the largest suffix of file0 that is a prefix of file1, that means a memory block B (or more precisely: the number of bytes of such a memory block) of maximum length so that

file0 consists of some memory block A, followed by B
file1 constist of memory block B, followed by some memory block C

Note that the blocks A, B and C can also have a length of zero bytes.

Edit (to answer drysdam's remark): the obvious rather slow algorithm that I think of (Pseudocode): let the length of the files be bounded by m, n with wlog m<=n.

for each length from m to 0
    compare the m last bytes of file0 with the first m bytes of file1
    if they are equal
        return length

This is obviously an O(m*min(m, n)) algorithm. If the files are about the same size this is O(n^2).

The files that I have to handle currently are of size of 10 up to a few hundread megabytes. But in extreme cases they can also be of size of a few gigabytes - just big enough not to fit into the 32 bit adress space of x86 anymore.

Consider treating your bytes as numbers 0..255 held as integers mod p, where p is a prime, optionally much larger than 255. Here are two ways of computing b0*x^2 + b1*x + b2:

(b0*x + b1)*x + b2

b0*x^2 + (b1*x + b2).

Therefore I can compute this quantity efficiently either by working from left to right - multiply by x and adding b2, or by working right to left - adding b0*x^2.

Pick a random x and compute this working from right to left in AB and from left to right in BC. If the values computed match, you note down the location. Later do a slow check of all the matches starting with the longest to see if the B really is identical in both cases.

What is the chance of a match at random? If you have a false match then (a0 - c0)*x^2 + (a1 - c1)*x + (a2 - c2) = 0. A polynomial of degree d has at most d roots, so if x is random the chance of a false match is at most d / p, and you can make this small by working mod p for suitably large p. (If I remember rightly there is a scheme for message authentication which has this idea at its heart).

Depending on how much memory you have available, you may want to consider building a suffix tree for the first file. Once you have this, you can query the prefix of the second file that maximally overlaps with a suffix of the second file by just walking the suffix tree down from the root along the edges matching the letters of the prefix of the second file. Since suffix trees can be built in linear time, the runtime for this algorithm is O(|A| + |B|) using your terminology, since it takes O(|A| + |B|) time to build the suffix tree and O(|B|) time to walk the suffix tree to find the block B.

If it is not an academical assignment then It might make sense to implement the simplest solution and see how it behaves on your data.

For example, a theoretically more efficient Knuth-Morris-Pratt algorithm -based solution can perform worse than IndexOf-based solution (see Overlap Detection).

For large files your program might spent all the time waiting for I/O.

继续阅读：algorithm prefix

Detect largest suffix of some file that is prefix of another file

更多精彩内容

精彩评论

最新问答

知某换热壁面的污垢热阻为0.0003(m2•K),若该换热壁面刚投...？

装修发票的税率是多少？

那个治不孕？

王者荣耀神钩对决怎么进?？

和平谁愿意当我师傅我有点技术但不多？

问答排行榜

Escaping "<" in Perl-generated XML

Is it allowed to ask users to enter credit card details for own payment method?

imessage会显示已读吗？

微信重新建群怎么建？

Heroku and DB GUI