Match a string from tons of patterns

2023-03-22 07:29 问答作者：

I want to make a system for URL matching. It will work in this way:

The database will contains many patterns. and some metadata of the pattern like this:

pattern1, keyword 
pattern2, keyword
...
...

I have a input URL. like htttp://example.com/blabla/111/2222/detail.htm

The system will get the input and output the keyword of the most matched pattern for the input URL. There will be more than 20,000 requests per second.

The thing we need to design is the pattern and the database model. I've spend over 2 weeks in this system.

I'm thinking about match the URL in a tree.

All the nodes in the tree are able to do 2 kinds of output: Which node should continue matching the URL, or the node know which keyword should be applied to the URL.

Each node will be connected with a callback(a script stored in db). So different node will have different behav开发者_C百科ior.

But the thing we have is tons of patterns. I think I need to have a facility to convert patterns into thos "nodes". Or at least can build a tree with existing nodes with the patterns in the db.

I'm still thinking about the tree generating. But there should be some better way.

Any ideas will be very helpful. Thank you!!!

You need one of the industrial-strength string matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm. I don't think a database-backed approach will work so well because it sounds like you need pattern matching, not exact prefix matching.

But if you are using prefix matching (the longest match from the beginning), then you can use a prefix trie, a trie. If I were you, I would use the database as a persistent store, but keep my matching trie in memory.

First, read this paper:

Regular Expression Matching Can Be Simple And Fast

In regexp notation, what you have is a simple "alternation":

pattern1|pattern2|pattern3|...

...with the additional constraint that you want to know which pattern matched. I believe augmenting the "Thompson NFA" to provide this detail would be straightforward. (Idea: Internally, put a unique magic token at the end of each pattern to uniquely identify the pattern. The magic token would match the empty string... So when your matching engine hits one, it immediately knows which pattern matched.)

This will give you the full power of regular expressions for your engine. Even if you do not want to adapt the NFA implementation from that paper, there is a huge amount of theoretical and practical work out there on regular expressions. So I would definitely start with the big alternation regexp and work from there.

To get better speed, you can try using a regular expression optimizer (something like Perl's Regexp::Optimizer) before converting your big alternation regexp into an NFA.

Or you might want to start with a generic regexp engine (like PCRE) and see if it is fast enough.

继续阅读：pattern-matching

Match a string from tons of patterns

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？