开发者

What is a typical algorithm for finding a string within a string?

I recently had an interview question that wen开发者_如何学JAVAt something like this:

Given a large string (haystack), find a substring (needle)?

I was a little stumped to come up with a decent solution.

What is the best way to approach this that doesn't have a poor time complexity?


I like the Boyer-Moore algorithm. It's especially fun to implement when you have a lot of needles to find in a haystack (for example, probable-spam patterns in an email corpus).


You can use the Knuth-Morris-Pratt algorithm, which is O(n+m) where n is the length of the "haystack" string and m is the length of the search string.


The general problem is string searching; there are many algorithms and approaches depending on the nature of the application.

  • Knuth-Morris-Pratt, searches for a string within another string
  • Boyer-Moore, another string within string search
  • Aho-Corasick; searches for words from a reference dictionary in a given arbitrary text

Some advanced index data structures are also used in other applications. Suffix trees are used a lot in bioinformatics; here you have one long reference text, and then you have many arbitrary strings that you want to find all occurrences for. Once the index (i.e. the tree) is built, finding patterns can be done quite efficiently.

For an interview answer, I think it's better to show breadth as well. Knowing about all these different algorithms and what specific purposes they serve best is probably better than knowing just one algorithm by heart.


I do not believe that there is anything better then looking at the haystack one character at a time and try to match them against the needle.

This will have linear time complexity (against the length of the haystack). It could degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix.


A typical algorithm would be as follows, with string indexes ranging from 0 to M-1.

It returns either the position of the match or -1 if not found.

foreach hpos in range 0 thru len(haystack)-len(needle):
    found = true
    foreach npos in range 0 thru len(needle):
        if haystack[hpos+npos] != needle[npos]:
            found = false
            break
    if found:
        return hpos
return -1

It has a reasonable performance since it only checks as many characters in each haystack position as needed to discover there's no chance of a match.

It's not necessarily the most efficient algorithm but if your interviewer expects you to know all the high performance algorithms off the top of your head, they're being unrealistic (i.e., fools). A good developer knows the basics well, and how to find out the advanced stuff where necessary (e.g., when there's a performance problem, not before).

The performance ranges between O(a) and O(ab) depending on the actual data in the strings, where a and b are the haystack and needle lengths respectively.

One possible improvement is to store, within the npos loop, the first location greater than hpos, where the character matches the first character of the needle.

That way you can skip hpos forward in the next iteration since you know there can be no possible match before that point. But again, that may not be necessary, depending on your performance requirements. You should work out the balance between speed and maintainability yourself.


This problem is discussed in Hacking a Google Interview Practice Questions – Person A. Their sample solution:

bool hasSubstring(const char *str, const char *find) {
    if (str[0] == '\0' && find[0] == '\0')
        return true;

    for(int i = 0; str[i] != '\0'; i++) {
        bool foundNonMatch = false;
        for(int j = 0; find[j] != '\0'; j++) {
            if (str[i + j] != find[j]) {
                foundNonMatch = true;
                break;
            }
        }
        if (!foundNonMatch)
            return true;
    }
    return false;
}


Here is a presentation of a few algorithm (shamelessly linked from Humboldt University). contains a few good algorithms such as Boyer More and Z-box.

I did use the Z-Box algorithm, found it to work well and was more efficient than Boyer-More, however needs some time to wrap your head around it.


the easiest way I think is "haystack".Contains("needle") ;) just fun,dont take it seriously.I think already you have got your answer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜