开发者

Detect largest suffix of some file that is prefix of another file

I have two files - let's call them file0 and file1.

What I would like t开发者_StackOverflowo get is a fast algorithm for the following problem (it is clear to me how to write a rather slow algorithm that solves it):

Detect the largest suffix of file0 that is a prefix of file1, that means a memory block B (or more precisely: the number of bytes of such a memory block) of maximum length so that

  • file0 consists of some memory block A, followed by B
  • file1 constist of memory block B, followed by some memory block C

Note that the blocks A, B and C can also have a length of zero bytes.

Edit (to answer drysdam's remark): the obvious rather slow algorithm that I think of (Pseudocode): let the length of the files be bounded by m, n with wlog m<=n.

for each length from m to 0
    compare the m last bytes of file0 with the first m bytes of file1
    if they are equal
        return length

This is obviously an O(m*min(m, n)) algorithm. If the files are about the same size this is O(n^2).

The files that I have to handle currently are of size of 10 up to a few hundread megabytes. But in extreme cases they can also be of size of a few gigabytes - just big enough not to fit into the 32 bit adress space of x86 anymore.


Consider treating your bytes as numbers 0..255 held as integers mod p, where p is a prime, optionally much larger than 255. Here are two ways of computing b0*x^2 + b1*x + b2:

(b0*x + b1)*x + b2

b0*x^2 + (b1*x + b2).

Therefore I can compute this quantity efficiently either by working from left to right - multiply by x and adding b2, or by working right to left - adding b0*x^2.

Pick a random x and compute this working from right to left in AB and from left to right in BC. If the values computed match, you note down the location. Later do a slow check of all the matches starting with the longest to see if the B really is identical in both cases.

What is the chance of a match at random? If you have a false match then (a0 - c0)*x^2 + (a1 - c1)*x + (a2 - c2) = 0. A polynomial of degree d has at most d roots, so if x is random the chance of a false match is at most d / p, and you can make this small by working mod p for suitably large p. (If I remember rightly there is a scheme for message authentication which has this idea at its heart).


Depending on how much memory you have available, you may want to consider building a suffix tree for the first file. Once you have this, you can query the prefix of the second file that maximally overlaps with a suffix of the second file by just walking the suffix tree down from the root along the edges matching the letters of the prefix of the second file. Since suffix trees can be built in linear time, the runtime for this algorithm is O(|A| + |B|) using your terminology, since it takes O(|A| + |B|) time to build the suffix tree and O(|B|) time to walk the suffix tree to find the block B.


If it is not an academical assignment then It might make sense to implement the simplest solution and see how it behaves on your data.

For example, a theoretically more efficient Knuth-Morris-Pratt algorithm -based solution can perform worse than IndexOf-based solution (see Overlap Detection).

For large files your program might spent all the time waiting for I/O.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜