Algorithm to pick the correct RegEx among many as fast as possible
The Django web framework uses regular expressions for dispatching incoming requests.
Let's say the application is huge and there are a lot of regular expressions, say hundreds.For any incoming request, how can I determine which regular expressions are matched as quickly as possible? It is insane to iterate over the listed regul开发者_Go百科ar expressions one by one.
One option would be to construct a single deterministic automaton that can match all of the regular expressions in parallel. One way to do this would be as follows:
- Convert each regular expression to a nondeterministic automaton using one of many standard conversion algorithms.
- Introduce a new start state with ε-transitions to all of the start states of these automata. This effectively creates one automaton that runs all of the regular expression automata in parallel. Make sure to label each accepting state in the NFAs differently so that it's clear which regular expression each accepting state corresponds to.
- Using the subset construction, reduce this NFA down to a DFA. During this process, when labeling a state as accepting, remember which of the automata considered that state an accepting state.
- Generate a table-driven implementation of the DFA.
Now, whenever you receive a new message, you can run the table-driven DFA on that message, which effectively runs every single regular expression in parallel and returns which regular expressions, if any, match. This may have a large cost in memory, since the generated DFA might be pretty big, but the time to match any incoming regular expression would be linear in the size of the string to match.
If you have control over the application, why not include some metadata that indicates the type of Regex to use? You can use that metadata to then select the correct RegEx.
Combine the regular expressions into one by concatenating them together with | and named groups: (?<1>regex1)|(?<2>regex2)|(?<3>regex3...
Test the input once and determine which regular expression matched by checking the named groups.
精彩评论