Longest Prefix Matches for URLs

2023-02-19 14:25 问答作者：

I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

I wanted to confirm if there is any standard python package which开发者_Go百科 can help me in doing this or should I implement a Trie for prefix matching.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

Thanks a lot.

Performance comparison

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` -functions

Setup

The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.

Three types of a search string:

non_existent_key - there is no match for the string
rare_key - around 20 in a million
frequent_key - number of occurrences is comparable to the collection size

Results

Maximum memory consumption for a million urls:

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

To reproduce the results, run the trie benchmark code.

rare_key/nonexistent_key case

if number of urls is less than 10000 then datrie is the fastest, for N>10000 - suffixtree is faster, startwith is significally slower on average.

Longest Prefix Matches for URLs

axes:
- vertical (time) scale is ~1 second (2**20 microseconds)
- horizontal axis shows total number of urls in each case: N= 1, 10, 100, 1000, 10000, 100000, and 1000000 (a million).

Longest Prefix Matches for URLs

frequent_key

Upto N=100000 datrie is the fastest (for a million urls the time is dominated by the trie construction time).

The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.

Longest Prefix Matches for URLs

startswith - time performance is independent from type of key.

trie and pytrie behave similar to each other.

Performance without trie construction time

datrie - the fastest, decent memory consumption
startswith is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.
datrie, pytrie, trie - almost O(1) (constant time) for rare/non_existent key

Longest Prefix Matches for URLs

Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

This example is good for small url lists but does not scale well.

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

An implementation using the trie module.

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

Result:

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

or using PyTrie which gives the same result but the lists are ordered differently.

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:

Longest Prefix Matches for URLs

Whereas the trie looks more like:

Longest Prefix Matches for URLs

The function below will return the index of the longest match. Other useful information can easily be extracted as well.

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))

If you are willing to trade RAM for the time performance then SuffixTree might be useful. It has nice algorithmic properties such as it allows to solve the longest common substring problem in a linear time.

If you always search for a prefix rather than an arbitrary substring then you could add a unique prefix while populating SubstringDict():

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

Such usage of SuffixTree seems suboptimal but it is 20-150 times faster (without SubstringDict()'s construction time) than @StephenPaulger's solution [which is based on .startswith()] on the data I've tried and it could be good enough.

To install SuffixTree, run:

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees

继续阅读：longest-prefix python trie

Longest Prefix Matches for URLs

Performance comparison

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` -functions

Setup

Results

Performance without trie construction time

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Performance comparison

suffixtree vs. pytrie vs. trie vs. datrie vs. startswith -functions

Setup

Results

Performance without trie construction time

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` -functions

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？