Finding whether a string starts with one of a list's variable-length prefixes
I need to find out whether a name starts with any of a list's prefixes and then remove it, like:
if name[:2] in ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]:
name = name[2:]
The above only works for list prefixes with a length of two. I need the same functionality for variable-length prefixes.
How is it done efficiently (little code and good performance)?
A for loop iterating over each prefix and then checking name.startswith(prefix)
to finally slice the name 开发者_运维技巧according to the length of the prefix works, but it's a lot of code, probably inefficient, and "non-Pythonic".
Does anybody have a nice solution?
str.startswith(prefix[, start[, end]])¶
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: prefixes = ("i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_")
In [2]: 'test'.startswith(prefixes)
Out[2]: False
In [3]: 'i_'.startswith(prefixes)
Out[3]: True
In [4]: 'd_a'.startswith(prefixes)
Out[4]: True
A bit hard to read, but this works:
name=name[len(filter(name.startswith,prefixes+[''])[0]):]
for prefix in prefixes:
if name.startswith(prefix):
name=name[len(prefix):]
break
Regexes will likely give you the best speed:
prefixes = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_", "also_longer_"]
re_prefixes = "|".join(re.escape(p) for p in prefixes)
m = re.match(re_prefixes, my_string)
if m:
my_string = my_string[m.end()-m.start():]
If you define prefix to be the characters before an underscore, then you can check for
if name.partition("_")[0] in ["i", "c", "m", "l", "d", "t", "e", "b", "foo"] and name.partition("_")[1] == "_":
name = name.partition("_")[2]
What about using filter
?
prefs = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
name = list(filter(lambda item: not any(item.startswith(prefix) for prefix in prefs), name))
Note that the comparison of each list item against the prefixes efficiently halts on the first match. This behaviour is guaranteed by the any
function that returns as soon as it finds a True
value, eg:
def gen():
print("yielding False")
yield False
print("yielding True")
yield True
print("yielding False again")
yield False
>>> any(gen()) # last two lines of gen() are not performed
yielding False
yielding True
True
Or, using re.match
instead of startswith
:
import re
patt = '|'.join(["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"])
name = list(filter(lambda item: not re.match(patt, item), name))
Regex, tested:
import re
def make_multi_prefix_matcher(prefixes):
regex_text = "|".join(re.escape(p) for p in prefixes)
print repr(regex_text)
return re.compile(regex_text).match
pfxs = "x ya foobar foo a|b z.".split()
names = "xenon yadda yeti food foob foobarre foo a|b a b z.yx zebra".split()
matcher = make_multi_prefix_matcher(pfxs)
for name in names:
m = matcher(name)
if not m:
print repr(name), "no match"
continue
n = m.end()
print repr(name), n, repr(name[n:])
Output:
'x|ya|foobar|foo|a\\|b|z\\.'
'xenon' 1 'enon'
'yadda' 2 'dda'
'yeti' no match
'food' 3 'd'
'foob' 3 'b'
'foobarre' 6 're'
'foo' 3 ''
'a|b' 3 ''
'a' no match
'b' no match
'z.yx' 2 'yx'
'zebra' no match
When it comes to search and efficiency always thinks of indexing techniques to improve your algorithms. If you have a long list of prefixes you can use an in-memory index by simple indexing the prefixes by the first character into a dict
.
This solution is only worth if you had a long list of prefixes and performance becomes an issue.
pref = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
#indexing prefixes in a dict. Do this only once.
d = dict()
for x in pref:
if not x[0] in d:
d[x[0]] = list()
d[x[0]].append(x)
name = "c_abcdf"
#lookup in d to only check elements with the same first character.
result = filter(lambda x: name.startswith(x),\
[] if name[0] not in d else d[name[0]])
print result
This edits the list on the fly, removing prefixes. The break
skips the rest of the prefixes once one is found for a particular item.
items = ['this', 'that', 'i_blah', 'joe_cool', 'what_this']
prefixes = ['i_', 'c_', 'a_', 'joe_', 'mark_']
for i,item in enumerate(items):
for p in prefixes:
if item.startswith(p):
items[i] = item[len(p):]
break
print items
Output
['this', 'that', 'blah', 'cool', 'what_this']
Could use a simple regex.
import re
prefixes = ("i_", "c_", "longer_")
re.sub(r'^(%s)' % '|'.join(prefixes), '', name)
Or if anything preceding an underscore is a valid prefix:
name.split('_', 1)[-1]
This removes any number of characters before the first underscore.
import re
def make_multi_prefix_replacer(prefixes):
if isinstance(prefixes,str):
prefixes = prefixes.split()
prefixes.sort(key = len, reverse=True)
pat = r'\b(%s)' % "|".join(map(re.escape, prefixes))
print 'regex patern :',repr(pat),'\n'
def suber(x, reg = re.compile(pat)):
return reg.sub('',x)
return suber
pfxs = "x ya foobar yaku foo a|b z."
replacer = make_multi_prefix_replacer(pfxs)
names = "xenon yadda yeti yakute food foob foobarre foo a|b a b z.yx zebra".split()
for name in names:
print repr(name),'\n',repr(replacer(name)),'\n'
ss = 'the yakute xenon is a|bcdf in the barfoobaratu foobarii'
print '\n',repr(ss),'\n',repr(replacer(ss)),'\n'
精彩评论