Python regex matching in conditionals
I am parsing file and I want to check each line against a few complicated regexs. Something like this
if re.match(regex1, line): do stuff
elif re.match(regex2, line): do other stuff
elif re.match(regex3, line): do still more stuff
...
Of course, to do the stuff, I need the match objects. I can only think of three possibilities, each of开发者_如何学Python which leaves something to be desired.
if re.match(regex1, line):
m = re.match(regex1, line)
do stuff
elif re.match(regex2, line):
m = re.match(regex2, line)
do other stuff
...
which requires doing the complicated matching twice (these are long files and long regex :/)
m = re.match(regex1, line)
if m: do stuff
else:
m = re.match(regex2, line)
if m: do other stuff
else:
...
which gets terrible as I indent further and further.
while True:
m = re.match(regex1, line)
if m:
do stuff
break
m = re.match(regex2, line)
if m:
do other stuff
break
...
which just looks weird.
What's the right way to do this?
You could define a function for the action required by each regex and do something like
def dostuff():
stuff
def dootherstuff():
otherstuff
def doevenmorestuff():
evenmorestuff
actions = ((regex1, dostuff), (regex2, dootherstuff), (regex3, doevenmorestuff))
for regex, action in actions:
m = re.match(regex, line)
if m:
action()
break
for patt in (regex1, regex2, regex3):
match = patt.match(line)
if match:
if patt == regex1:
# some handling
elif patt == regex2:
# more
elif patt == regex3:
# more
break
I like Tim's answer because it separates out the per-regex matching code to keep things simple. For my answer, I wouldn't put more than a line or two of code for each match, and if you need more, call a separate method.
In this particular case there appears to be no convenient way to do this in python. if python would accept the syntax:
if (m = re.match(pattern,string)):
text = m.group(1)
then all would be fine, but apparently you cannot do that
First off, do you really need to use regexps for your matching? Where I would use regexps in, e.g., perl, I'll often use string functions in python (find, startswith, etc).
If you really need to use regexps, you can make a simple search function that does the search, and if the match is returned, sets a store object to keep your match around before returning True.
e.g.,
def search(pattern, s, store):
match = re.search(pattern, s)
store.match = match
return match is not None
class MatchStore(object):
pass # irrelevant, any object with a 'match' attr would do
where = MatchStore()
if search(pattern1, s, where):
pattern1 matched, matchobj in where.match
elif search(pattern2, s, where):
pattern2 matched, matchobj in where.match
...
Your last suggestion is slightly more Pythonic when wrapped up in a function:
def parse_line():
m = re.match(regex1, line)
if m:
do stuff
return
m = re.match(regex2, line)
if m:
do other stuff
return
...
That said, you can get closer to what you want using a simple container class with some operator overloading class:
class ValueCache():
"""A simple container with a returning assignment operator."""
def __init__(self, value=None):
self.value = value
def __repr__(self):
return "ValueCache({})".format(self.value)
def set(self, value):
self.value = value
return value
def __call__(self):
return self.value
def __lshift__(self, value):
return self.set(value)
def __rrshift__(self, value):
return self.set(value)
match = ValueCache()
if (match << re.match(regex1, line)):
do stuff with match()
elif (match << re.match(regex2, line)):
do other stuff with match()
You can define a local function that accepts a regex, tests it against your input, and stores the result to a closure-scoped variable:
match = None
def matches(pattern):
nonlocal match, line
match = re.match(pattern, line)
return match
if matches(regex1):
# do stuff with `match`
elif matches(regex2):
# do other stuff with `match`
I'm not sure how Pythonic that approach is, but it's the cleanest way I've found to do regex matching in an if-elif-else chain and preserve the match objects.
Note that this approach will only work in Python 3.0+ as it requires the PEP 3104 nonlocal
statement. In earlier Python versions there's no clean way for a function to assign to a variable in a non-global parent scope.
It's also worth noting that if you have a big enough file that you're worried about running a regex twice for each line you should also be pre-compiling them with re.compile
and passing the resulting regex object to your check function instead of the raw string.
I would break your regex up into smaller components and search for simple first with longer matches later.
something like:
if re.match(simplepart,line):
if re.match(complexregex, line):
do stuff
elif re.match(othersimple, line):
if re.match(complexother, line):
do other stuff
Why not use a dictionnary/switch statement ?
def action1(stuff):
do the stuff 1
def action2(stuff):
do the stuff 2
regex_action_dict = {regex1 : action1, regex2 : action2}
for regex, action in regex_action_dict.iteritems():
match_object = re.match(regex, line):
if match_object:
action(match_object, line)
FWIW, I've stressed over the same thing, and I usually settle for the 2nd form (nested else
s) or some variation. I don't think you'll find anything much better in general, if you're looking to optimize readability (many of these answers seem significantly less readable than your candidates to me).
Sometimes if you're in an outer loop or a short function, you can use a variation of your 3rd form (the one with break
statements) where you either continue
or return
, and that's readable enough, but I definitely wouldn't create a while True
block just to avoid the "ugliness" of the other candidates.
My solution with an exemple; there is only one re.search()
that is performed:
text = '''\
koala + image @ wolf - snow
Good evening, ladies and gentlemen
An uninteresting line
There were 152 ravens on a branch
sea mountain sun ocean ice hot desert river'''
import re
regx3 = re.compile('hot[ \t]+([^ ]+)')
regx2 = re.compile('(\d+|ev.+?ng)')
regx1 = re.compile('([%~#`\@+=\d]+)')
regx = re.compile('|'.join((regx3.pattern,regx2.pattern,regx1.pattern)))
def one_func(line):
print 'I am one_func on : '+line
def other_func(line):
print 'I am other_func on : '+line
def another_func(line):
print 'I am another_func on : '+line
tupl_funcs = (one_func, other_func, another_func)
for line in text.splitlines():
print line
m = regx.search(line)
if m:
print 'm.groups() : ',m.groups()
group_number = (i for i,m in enumerate(m.groups()) if m).next()
print "group_number : ",group_number
tupl_funcs[group_number](line)
else:
print 'No match'
print 'No treatment'
print
result
koala + image @ wolf - snow
m.groups() : (None, None, '+')
group_number : 2
I am another_func on : koala + image @ wolf - snow
Good evening, ladies and gentlemen
m.groups() : (None, 'evening', None)
group_number : 1
I am other_func on : Good evening, ladies and gentlemen
An uninteresting line
No match
No treatment
There were 152 ravens on a branch
m.groups() : (None, '152', None)
group_number : 1
I am other_func on : There were 152 ravens on a branch
sea mountain sun ocean ice hot desert river
m.groups() : ('desert', None, None)
group_number : 0
I am one_func on : sea mountain sun ocean ice hot desert river
Make a class with the match as state. Instantiate it before conditional, this should store the string that you are matching against as well.
You can define a class wrapping the match object with a call method to perform the match:
class ReMatcher(object):
match = None
def __call__(self, pattern, string):
self.match = re.match(pattern, string)
return self.match
def __getattr__(self, name):
return getattr(self.match, name)
Then call it in your conditions and use it as if it was a match object in the resulting blocks:
match = ReMatcher()
if match(regex1, line):
print(match.group(1))
elif match(regex2, line):
print(match.group(1))
This should work in nearly any Python version, with slight adjustments in versions before new-style classes. As in my other answer, you should use re.compile
if you're concerned about regex performance.
精彩评论