开发者

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session开发者_JS百科 and are not many (~10). Additional simplification is that none of the strings is contained in any other.

I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.

As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.

e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.

Any ideas or references are appreciated.


Check out the Aho–Corasick string matching algorithm!


Take a look at Suffix Tree.


Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html

The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.

The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.

Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.

Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)

Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.

When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().

Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.

So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.


I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.

What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.

I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).

Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.


Regex engine initialization is expected to have some overhead, so if there are no real regular expressions involved, C - memcmp() should do fine.

If you can tell the file sizes and give some specific use cases, we could build a benchmark (I consider this very interesting).

Interesting: memcmp explorations and timing differences

Regards

rbo


There is always Boyer Moore


Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.


You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools. You can use Lex for getting tokens of the string. Compare your pre-defined strings with the tokens returned from Lexer. When match is found, perform the desired action. There are many sites which describe about Lex and Yacc. One such site is http://epaperpress.com/lexandyacc/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜