开发者

Better than O(n) glob matcher?

The problem: Given a list of globs, I need to find (and return) a glob from the list that a given string matches or definitively determine that none match in. Excluding setup time, performance must be better than a linear search of all the globs:

foreach glob in list:
  if glob.matches(string):
    return glob
return None

The question: Are there any available libraries (C++ preferred) for this?


Edit: After a bit more thought, I thinkin I can argue that this can be done. Given开发者_C百科 that globs are more or less regex with a different syntax, a runtime version of lex that uses glob syntax would fit the bill.

Given that the problem can be trivially reduced to a know problem, I'm only still interested in implemented solutions.


Convert your globs into regular expressions (a series of simple string manipulations can achieve this - * becomes .*, etc). Combine them into a single regular expression, using | and assigning the results to a different capture group for each glob so that you can determine which glob matched if there was a match. Rely on your favourite regex library to compile the regular expression into a DFA that is hopefully more optimal to process than a linear walk of the constituent parts, where this is possible - however, in the most general case, it will not be faster.


Globs are a subset of regular expressions (with respect to expressive power, not exact syntax). Globs can therefore be converted into deterministic finite automata (DFA) and those can be combined to form a single DFA that recognizes the union of the single DFAs. DFAs have a complexity of O(n) with n being the length of the string. How much Globs the automaton is constructed from only influences the setup time (i.e. creating the automaton), not the run time.


Probably just in some particular cases. If you can predict somehow that some globs will not match your string.


I don't think it's possible to have better than linear time in the number of globs. In order to prove that a string doesn't match any of the globs, you have to check for a match against every one, or you'll never know whether one you skipped matched or not.

EDIT: In the general case this isn't possible using globs. As noted in a comment it's possible for some combinations of globs to be combined (At first guess a trie might be useful, where each node indicates the next letter to match and the globs you still need to check), causing a smaller amount of work searching.

It also might be possible in the general case to convert a set of globs into a corresponding regular expression.

Is performance of this matching really such an issue that you need to improve it? Have you considered if a more fundamental algorithmic rewrite might be better?


I would look to see if your application is a good fit for a shift-reduce parser like bison. They use lookup tables, which are a pain to set up or change and use more memory, but are very fast. Technically, it's not physically possible to do better than O(n) worst case, but depending on your globs, your strings, and your tokenizer, using a technique like this can make your average case much better than that, because it eliminates patterns that don't match without having to check each one each time.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜