Approximate string matching using backtracking

2023-04-09 00:15 问答作者：

I would like to use backt开发者_JAVA技巧racking to search for all substrings in a long string allowing for variable length matches - that is matches allowing for a maximum given number of mismatches, insertions, and deletions. I have not been able to locate any useful examples. The closest I have found is this paper here, but that is terribly complex. Anyone?

Cheers,

Martin

Algorithm

The function ff() below uses recursion (i.e. backtracking) to solve your problem. The basic idea is that at the start of any call to f(), we are trying to match a suffix t of the original "needle" string to a suffix s of the "haystack" string, while allowing only a certain number of each type of edit operation.

// ss is the start of the haystack, used only for reporting the match endpoints.
void f(char* ss, char* s, char* t, int mm, int ins, int del) {
    while (*s && *s == *t) ++s, ++t;    // OK to always match longest segment
    if (!*t) printf("%d\n", s - ss);    // Matched; print endpoint of match
    if (mm && *s && *t) f(ss, s + 1, t + 1, mm - 1, ins, del);
    if (ins && *s) f(ss, s + 1, t, mm, ins - 1, del);
    if (del && *t) f(ss, s, t + 1, mm, ins, del - 1);
}

// Find all occurrences of t starting at any position in s, with at most
// mm mismatches, ins insertions and del deletions.
void ff(char* s, char* t, int mm, int ins, int del) {
    for (char* ss = s; *s; ++s) {
//      printf("Starting from offset %d...\n", s - ss);
        f(ss, s, t, mm, ins, del);
    }
}

Example call:

ff("xxabcydef", "abcdefg", 1, 1, 1);

This outputs:

9
9

because there are two ways to find "abcdefg" in "xxabcydef" with at most 1 of each kind of change, and both of these ways end at position 9:

Haystack: xxabcydef-
Needle:     abc-defg

which has 1 insertion (of y) and 1 deletion (of g), and

Haystack: xxabcyde-f
Needle:     abc-defg

which has 1 insertion (of y), 1 deletion (of f), and 1 substitution of g to f.

Dominance Relation

It may not be obvious why it's actually safe to use the while loop on line 3 to greedily match as many characters as possible at the start of the two strings. In fact this may reduce the number of times that a particular end position will be reported as a match, but it will never cause an end position to be forgotten completely -- and since we're usually interested in just whether or not there is a match ending at a given position of the haystack, and without this while loop the algorithm would always take time exponential in the needle size, this seems a win-win.

It is guaranteed to work because of a dominance relation. To see this, suppose the opposite -- that it is in fact unsafe (i.e. misses some matches). Then there would be some match in which an initial segment of equal characters from both strings are not aligned to each other, for example:

Haystack: abbbbc
Needle:   a-b-bc

However, any such match can be transformed into another match having the same number of operations of each type, and ending at the same position, by shunting the leftmost character following a gap to the left of the gap:

Haystack: abbbbc
Needle:   ab--bc

If you do this repeatedly until it's no longer possible to shunt characters without requiring a substitution, you will get a match in which the largest initial segment of equal characters from both strings are aligned to each other:

Haystack: abbbbc
Needle:   abb--c

My algorithm will find all such matches, so it follows that no match position will be overlooked by it.

Exponential Time

Like any backtracking program, this function will exhibit exponential slowdowns on certain inputs. Of course, it may be that on the inputs you happen to use, this doesn't occur, and it works out faster than particular implementations of DP algorithms.

I would start with Levenshtein's distance algorithm, which is the standard approach when checking for string similarities via mismatch, insertion and deletion.

Since the algorithm uses bottom up dynamic programming, you'll probably be able to find all substrings without having to execute the algorithm for each potential substring.

The nicest algorithm I'm aware of for this is A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming by Gene Myers. Given a text to search of length n, a pattern string to search for of length m and a maximum number of mismatches/insertions/deletions k, this algorithm takes time O(mn/w), where w is your computer's word size (32 or 64). If you know much about algorithms on strings, it's actually pretty incredible that an algorithm exists that takes time independent of k -- for a long time, this seemed an impossible goal.

I'm not aware of an existing implementation of the above algorithm. If you want a tool, agrep may be just what you need. It uses an earlier algorithm that takes time O(mnk/w), but it's fast enough for low k -- miles ahead of a backtracking search in the worst case.

agrep is based on the Shift-Or (or "Bitap") algorithm, which is a very clever dynamic programming algorithm that manages to represent its state as bits in an integer and get binary addition to do most of the work of updating the state, which is what speeds up the algorithm by a factor of 32 or 64 over a more typical implementation. :) Myers's algorithm also uses this idea to get its 1/w speed factor.

继续阅读：algorithm backtracking string-matching

Approximate string matching using backtracking

Algorithm

Dominance Relation

Exponential Time

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Algorithm

Dominance Relation

Exponential Time

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？