Delete all characters in a multiline string up to a given pattern
Using Python I need to delete all characters in a multiline string up to the first occurrence of a given pattern. In Perl this can be done using regular expressions with something like:
#remove all chars up to first occurrence of cat or dog or rat
$pattern = 'cat|dog|rat'
$pagetext =~ s/(.*?)($pattern)/$2/xms开发者_开发技巧;
What's the best way to do it in Python?
>>> import re
>>> s = 'hello cat!'
>>> m = re.search('cat|dog|rat', s)
>>> s[m.start():]
'cat!'
Of course you'll need to account for the case where there's no match in a real solution.
Or, more cleanly:
>>> import re
>>> s = 'hello cat!'
>>> p = 'cat|dog|rat'
>>> re.sub('.*?(?=%s)' % p, '', s, 1)
'cat!'
For multiline, use the re.DOTALL
flag.
You want to delete all characters preceding the first occurrence of a pattern; as an example, you give "cat|dog|rat".
Code that achieves this using re
:
re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
or, if you'll be using again this regular expression:
rex= re.compile("(?s).*?(cat|dog|rat)")
result= rex.sub("\\1", input_text, 1)
Note the non-greedy .*?
. The initial (?s)
allows to match newline characters too, before the word matching.
Examples:
>>> input_text= "I have a dog and a cat"
>>> re.sub(".*?(cat|dog|rat)", "\\1", input_text, 1)
'dog and a cat'
>>> re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
'I have no animals!'
>>> input_text= "This is irrational"
>>> re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
'rational'
In case you want to do the conversion only for the words cat, dog and rat, you'll have to change the regex into:
>>> re.sub(r"(?s).*?\b(cat|dog|rat)\b", "\\1", input_text, 1)
'This is irrational'
non regex way
>>> s='hello cat!'
>>> pat=['cat','dog','rat']
>>> for n,i in enumerate(pat):
... m=s.find(i)
... if m != -1: print s[m:]
...
cat!
Something like this should do what you want:
import re
text = ' sdfda faf foo zing baz bar'
match = re.search('foo|bar', text)
if match:
print text[match.start():] # ==> 'foo zing baz bar'
Another option is to use look ahead s/.*?(?=$pattern)//xs
:
re.sub(r'(?s).*?(?=cat|dog|rat)', '', text, 1)
Non-regex way:
for option in 'cat dog rat'.split():
index = text.find(option)
if index != -1: # found
text = text[index:]
break
Non-regex way is almost 5 times faster (for some input):
$ python -mtimeit -s'from drop_until_word import drop_re, text, options;' \
> 'drop_re(text, options)'
1000 loops, best of 3: 1.06 msec per loop
$ python -mtimeit -s'from drop_until_word import drop_search, text, options;'\
> 'drop_search(text, options)'
10000 loops, best of 3: 184 usec per loop
$ python -mtimeit -s'from drop_until_word import drop_find, text, options;' \
> 'drop_find(text, options)'
1000 loops, best of 3: 207 usec per loop
Where drop_until_word.py
is:
import re
def drop_re(text, options):
return re.sub(r'(?s).*?(?='+'|'.join(map(re.escape, options))+')', '',
text, 1)
def drop_re2(text, options):
return re.sub(r'(?s).*?('+'|'.join(map(re.escape, options))+')', '\\1',
text, 1)
def drop_search(text, options):
m = re.search('|'.join(map(re.escape, options)), text)
return text[m.start():] if m else text
def drop_find(text, options):
indexes = [i for i in (text.find(option) for option in options) if i != -1]
return text[min(indexes):] if indexes else text
text = open('/usr/share/dict/words').read()
options = 'cat dog rat'.split()
def test():
assert drop_find(text, options) == drop_re(text, options) \
== drop_re2(text, options) == drop_search(text, options)
txt = 'dog before cat'
r = txt
for f in [drop_find, drop_re, drop_re2, drop_search]:
assert r == f(txt, options), f.__name__
if __name__=="__main__":
test()
精彩评论