What is a typical algorithm for finding a string within a string?

2022-12-24 23:40 问答作者：

I recently had an interview question that wen开发者_如何学JAVAt something like this:

Given a large string (haystack), find a substring (needle)?

I was a little stumped to come up with a decent solution.

What is the best way to approach this that doesn't have a poor time complexity?

I like the Boyer-Moore algorithm. It's especially fun to implement when you have a lot of needles to find in a haystack (for example, probable-spam patterns in an email corpus).

You can use the Knuth-Morris-Pratt algorithm, which is O(n+m) where n is the length of the "haystack" string and m is the length of the search string.

The general problem is string searching; there are many algorithms and approaches depending on the nature of the application.

Knuth-Morris-Pratt, searches for a string within another string
Boyer-Moore, another string within string search
Aho-Corasick; searches for words from a reference dictionary in a given arbitrary text

Some advanced index data structures are also used in other applications. Suffix trees are used a lot in bioinformatics; here you have one long reference text, and then you have many arbitrary strings that you want to find all occurrences for. Once the index (i.e. the tree) is built, finding patterns can be done quite efficiently.

For an interview answer, I think it's better to show breadth as well. Knowing about all these different algorithms and what specific purposes they serve best is probably better than knowing just one algorithm by heart.

I do not believe that there is anything better then looking at the haystack one character at a time and try to match them against the needle.

This will have linear time complexity (against the length of the haystack). It could degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix.

A typical algorithm would be as follows, with string indexes ranging from 0 to M-1.

It returns either the position of the match or -1 if not found.

foreach hpos in range 0 thru len(haystack)-len(needle):
    found = true
    foreach npos in range 0 thru len(needle):
        if haystack[hpos+npos] != needle[npos]:
            found = false
            break
    if found:
        return hpos
return -1

It has a reasonable performance since it only checks as many characters in each haystack position as needed to discover there's no chance of a match.

It's not necessarily the most efficient algorithm but if your interviewer expects you to know all the high performance algorithms off the top of your head, they're being unrealistic (i.e., fools). A good developer knows the basics well, and how to find out the advanced stuff where necessary (e.g., when there's a performance problem, not before).

The performance ranges between O(a) and O(ab) depending on the actual data in the strings, where a and b are the haystack and needle lengths respectively.

One possible improvement is to store, within the npos loop, the first location greater than hpos, where the character matches the first character of the needle.

That way you can skip hpos forward in the next iteration since you know there can be no possible match before that point. But again, that may not be necessary, depending on your performance requirements. You should work out the balance between speed and maintainability yourself.

This problem is discussed in Hacking a Google Interview Practice Questions – Person A. Their sample solution:

bool hasSubstring(const char *str, const char *find) {
    if (str[0] == '\0' && find[0] == '\0')
        return true;

    for(int i = 0; str[i] != '\0'; i++) {
        bool foundNonMatch = false;
        for(int j = 0; find[j] != '\0'; j++) {
            if (str[i + j] != find[j]) {
                foundNonMatch = true;
                break;
            }
        }
        if (!foundNonMatch)
            return true;
    }
    return false;
}

Here is a presentation of a few algorithm (shamelessly linked from Humboldt University). contains a few good algorithms such as Boyer More and Z-box.

I did use the Z-Box algorithm, found it to work well and was more efficient than Boyer-More, however needs some time to wrap your head around it.

the easiest way I think is "haystack".Contains("needle") ;) just fun,dont take it seriously.I think already you have got your answer.

继续阅读：algorithm language-agnostic

What is a typical algorithm for finding a string within a string?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？