开发者

Match a string from tons of patterns

I want to make a system for URL matching. It will work in this way:

The database will contains many patterns. and some metadata of the pattern like this:

pattern1, keyword 
pattern2, keyword
...
...

I have a input URL. like htttp://example.com/blabla/111/2222/detail.htm

The system will get the input and output the keyword of the most matched pattern for the input URL. There will be more than 20,000 requests per second.

The thing we need to design is the pattern and the database model. I've spend over 2 weeks in this system.

I'm thinking about match the URL in a tree.

All the nodes in the tree are able to do 2 kinds of output: Which node should continue matching the URL, or the node know which keyword should be applied to the URL.

Each node will be connected with a callback(a script stored in db). So different node will have different behav开发者_C百科ior.

But the thing we have is tons of patterns. I think I need to have a facility to convert patterns into thos "nodes". Or at least can build a tree with existing nodes with the patterns in the db.

I'm still thinking about the tree generating. But there should be some better way.

Any ideas will be very helpful. Thank you!!!


You need one of the industrial-strength string matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm. I don't think a database-backed approach will work so well because it sounds like you need pattern matching, not exact prefix matching.

But if you are using prefix matching (the longest match from the beginning), then you can use a prefix trie, a trie. If I were you, I would use the database as a persistent store, but keep my matching trie in memory.


First, read this paper:

Regular Expression Matching Can Be Simple And Fast

In regexp notation, what you have is a simple "alternation":

pattern1|pattern2|pattern3|...

...with the additional constraint that you want to know which pattern matched. I believe augmenting the "Thompson NFA" to provide this detail would be straightforward. (Idea: Internally, put a unique magic token at the end of each pattern to uniquely identify the pattern. The magic token would match the empty string... So when your matching engine hits one, it immediately knows which pattern matched.)

This will give you the full power of regular expressions for your engine. Even if you do not want to adapt the NFA implementation from that paper, there is a huge amount of theoretical and practical work out there on regular expressions. So I would definitely start with the big alternation regexp and work from there.

To get better speed, you can try using a regular expression optimizer (something like Perl's Regexp::Optimizer) before converting your big alternation regexp into an NFA.

Or you might want to start with a generic regexp engine (like PCRE) and see if it is fast enough.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜