A Way to Group URLs
I have a list of URLs, each associated with a set of numbers. For example:
http://example.com/ - 0
http://example.com/login/ - 1
http://example.com/login/verify/ - 2
http://example.com/user123/home/ - 3
http://example.com/user254/home/ - 3
http://example.com/user123/edit/ - 4
I want some method to 'compress' this, maybe using regexp -- the catch is that for all URLs not in the list I can assume they map to whatever number I want.
So an out开发者_开发知识库put like this -- any URL is checked against each expression in this order, and given a number according to the first match.
http://example.com/login/verify* - 2
http://example.com/login/* - 1
http://example.com/*/home/ - 3
http://example.com/*/edit - 4
http://example.com/* - 0
Note: There are multiple possible outputs like this that are acceptable. Also, I considered something like a tree, where each node contains an expression like one of the above, and the leaves at the end are the actual URLs to check against.
Another Note, I said mapped to numbers for simplicity's sake. Actually, they are mapped to a set of numbers, where the set has to match. Just in-case that helps someone come up with a solution (though I doubt it).
It looks like it will be easiest to use a different regex for each URL match, they would probably look something like this.
http://example\.com/login/verify
http://example\.com/login
http://example\.com/[^/]+/home
http://example\.com/[^/]+/edit
http://example\.com
Try to match the URL to each of these in order, then when it matches look up the number (or set) that corresponds to that match.
Alternatively you could use a single regex with capturing groups to determine which URL was actually matched, for example:
http://example\.com(?:(/login/verify)|(/login)|(/[^/]+/home)|(/[^/]+/edit))?
Here is a Rubular that shows how you could use the previous regex: http://www.rubular.com/r/tklqMs8U1Z
edit: Here is a Python function that does what I think you're looking for.
import re
def url_match(url):
base = "http://example.com"
endings = [("/login/verify", 2), ("/login", 1), ("/*/home", 3), ("/*/edit", 4), ("", 0)]
re_endings = ["(%s)" % re.escape(x[0]).replace(r"\*", "[^/]+") for x in endings]
pattern = re.compile("%s(?:%s)" % (re.escape(base), "|".join(re_endings)))
match = pattern.match(url)
if match is None:
return None
index = [i for i, x in enumerate(match.groups()) if x is not None]
return endings[index[0]][1]
url_match("http://example.com") # 0
url_match("http://example.com/login") # 1
url_match("http://example.com/login/verify") # 2
url_match("http://example.com/user123/home") # 3
url_match("http://example.com/user123/edit") # 4
url_match("http://sample.com") # None
What you are asking for is clustering of the URL based on the webpath. You can check out K-means clustering of text document . It does explain this in details.
精彩评论