Fast algorithm for searching for substrings in a string
I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.
What I would like to do is:
Given an input string - INSTR:
"BCDEFGH"
And a set of candidate strings - CAND:
"AB", "CDE", "FG", "H", "IJ"
Find any CAND strings that match as substrings开发者_运维问答 within INSTR
In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")
There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.
I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.
Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.
Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.
Update I only need to know that there was a match - and i don't need to know what the match was.
Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this
In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:
AhoCorsick: 1
RabinKarp: 1.8
Naive Search (check each pattern & use string.contains): 50
*Some resources describing the algos mentioned in the answers below:
http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html
http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf
http://www-igm.univ-mlv.fr/~lecroq/string/index.html*
Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.
If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.
Implementations:
- https://hkn.eecs.berkeley.edu/~dyoo/java/index.html
- http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html
- https://github.com/hankcs/AhoCorasickDoubleArrayTrie
- https://github.com/RokLenarcic/AhoCorasick
- https://github.com/robert-bor/aho-corasick
Presentations:
- http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438
Convert the set of candidate strings into a deterministic finite state automaton and then run through the input string in linear time. Converting a single string into a DFS is well-covered in the standard books. You can convert a set of strings by first constructing a non-deterministic automaton and then determinizing it. That can create exponential blow-up in the worst case in the size of the automaton but the search afterwards is fast; especially if the target string is long and the candidates short that's going to work well.
This is what regular expressions are for. As noted above, finite state automata are what you need, but that is exactly how a standard regexp-matcher is implemented.
In java you could write something like:
StringBuilder sb = new StringBuilder();
bool first = true;
for (String subStr : substrings) {
if (first)
first = false;
else
sb.append('|');
sb.append(escape(subStr));
}
Pattern p = Pattern.compile(sb.toString());
the method escape
should escape any characters which have special meanings in a regexp.
Rabin-Karp multiple pattern search appears to be the fastest.
You might want to look into Aho-Corasick algorithm and related algorithms. I don't know of any libraries that implement this, offhand, but this is the classic way of solving this problem.
Also check the Boyer-Moore algorithm for single-string pattern matching.
We can take advantage of the small size (< 50 char) of the strings to build a super fast algo for this case, at the cost of memory.
We can hash all possible substrings of INSTR in a hash one time that will cost O(n^2) time. Then regardless of the number of CAND strings, the lookup will be O(1). Worth it for a very large number of CAND strings.
If INSTR is large, then we can build a suffix array and not sort it, so that the top item is the longest (=N) and bottom item is the last char of INSTR. Now for each CAND string, only search from the top as long as length(CAND) <= length(suffix). Each of those comparisons will be O(n).
Another solution is to use a suffix array for the INSTR.
Since the INSTR is small you can sort it with bubble sort.
Afterwards you can search for a specific CAND string in O(logN) time,
where N = length(suffix_array) = length(INSTR).
Here are some implementation of fast String search algorithms in Java.
import java.util.Scanner;
public class StringMatch
{
static int temp,i=0,j=0; static boolean flag=true,matcher=false;
static String str=null,mstr=null;static char astr[],amstr[];
static void getter(){
Scanner sc = new Scanner(System.in);
str = sc.nextLine();
//String str="today is Monday";
astr=str.toCharArray();
mstr = sc.nextLine();
//String mstr="is";
amstr=mstr.toCharArray();
}
static void stringMatch(){
while(i<astr.length){
if(astr[i]==amstr[j]){
while((j!=amstr.length)&&flag){temp=i;
if(astr[i]!=amstr[j]) {flag=false;matcher=false;}
else{matcher=true;}
i++;j++;
//System.out.println(i+"\t"+j);
}if(matcher==true)break;i=temp;}i++;j=0;flag=true;
}
if(matcher==true) {System.out.println("true");}
else {System.out.println("false");}
}
public static void main(String[] args) {
StringMatch.getter();
StringMatch.stringMatch();
}
}
精彩评论