开发者

Python regex matching in conditionals

I am parsing file and I want to check each line against a few complicated regexs. Something like this

if re.match(regex1, line): do stuff
elif re.match(regex2, line): do other stuff
elif re.match(regex3, line): do still more stuff
...

Of course, to do the stuff, I need the match objects. I can only think of three possibilities, each of开发者_如何学Python which leaves something to be desired.

if re.match(regex1, line): 
    m = re.match(regex1, line)
    do stuff
elif re.match(regex2, line):
    m = re.match(regex2, line)
    do other stuff
...

which requires doing the complicated matching twice (these are long files and long regex :/)

m = re.match(regex1, line)
if m: do stuff
else:
    m = re.match(regex2, line)
    if m: do other stuff
    else:
       ...

which gets terrible as I indent further and further.

while True:
    m = re.match(regex1, line)
    if m:
        do stuff
        break
    m = re.match(regex2, line)
    if m:
        do other stuff
        break
    ...

which just looks weird.

What's the right way to do this?


You could define a function for the action required by each regex and do something like

def dostuff():
    stuff

def dootherstuff():
    otherstuff

def doevenmorestuff():
    evenmorestuff

actions = ((regex1, dostuff), (regex2, dootherstuff), (regex3, doevenmorestuff))

for regex, action in actions:
    m = re.match(regex, line)
    if m: 
        action()
        break


for patt in (regex1, regex2, regex3):
    match = patt.match(line)
    if match:
        if patt == regex1:
            # some handling
        elif patt == regex2:
            # more
        elif patt == regex3:
            # more
        break

I like Tim's answer because it separates out the per-regex matching code to keep things simple. For my answer, I wouldn't put more than a line or two of code for each match, and if you need more, call a separate method.


In this particular case there appears to be no convenient way to do this in python. if python would accept the syntax:

if (m = re.match(pattern,string)):
    text = m.group(1)

then all would be fine, but apparently you cannot do that


First off, do you really need to use regexps for your matching? Where I would use regexps in, e.g., perl, I'll often use string functions in python (find, startswith, etc).

If you really need to use regexps, you can make a simple search function that does the search, and if the match is returned, sets a store object to keep your match around before returning True.

e.g.,

def search(pattern, s, store):
    match = re.search(pattern, s)
    store.match = match
    return match is not None

class MatchStore(object):
    pass   # irrelevant, any object with a 'match' attr would do

where = MatchStore()
if search(pattern1, s, where):
    pattern1 matched, matchobj in where.match
elif search(pattern2, s, where):
    pattern2 matched, matchobj in where.match
...


Your last suggestion is slightly more Pythonic when wrapped up in a function:

def parse_line():
    m = re.match(regex1, line)
    if m:
        do stuff
        return
    m = re.match(regex2, line)
    if m:
        do other stuff
        return
    ...

That said, you can get closer to what you want using a simple container class with some operator overloading class:

class ValueCache():
    """A simple container with a returning assignment operator."""
    def __init__(self, value=None):
        self.value = value
    def __repr__(self):
        return "ValueCache({})".format(self.value)
    def set(self, value):
        self.value = value
        return value
    def __call__(self):
        return self.value
    def __lshift__(self, value):
        return self.set(value)
    def __rrshift__(self, value):
        return self.set(value)

match = ValueCache()
if (match << re.match(regex1, line)):
    do stuff with match()
elif (match << re.match(regex2, line)):
    do other stuff with match()


You can define a local function that accepts a regex, tests it against your input, and stores the result to a closure-scoped variable:

match = None

def matches(pattern):
    nonlocal match, line
    match = re.match(pattern, line)
    return match

if matches(regex1):
    # do stuff with `match`

elif matches(regex2):
    # do other stuff with `match`

I'm not sure how Pythonic that approach is, but it's the cleanest way I've found to do regex matching in an if-elif-else chain and preserve the match objects.

Note that this approach will only work in Python 3.0+ as it requires the PEP 3104 nonlocal statement. In earlier Python versions there's no clean way for a function to assign to a variable in a non-global parent scope.

It's also worth noting that if you have a big enough file that you're worried about running a regex twice for each line you should also be pre-compiling them with re.compile and passing the resulting regex object to your check function instead of the raw string.


I would break your regex up into smaller components and search for simple first with longer matches later.

something like:

if re.match(simplepart,line):
      if re.match(complexregex, line):
          do stuff
elif re.match(othersimple, line):
      if re.match(complexother, line):
          do other stuff


Why not use a dictionnary/switch statement ?

def action1(stuff):
    do the stuff 1
def action2(stuff):
    do the stuff 2

regex_action_dict = {regex1 : action1, regex2 : action2}
for regex, action in regex_action_dict.iteritems():
    match_object = re.match(regex, line):
    if match_object:
        action(match_object, line)


FWIW, I've stressed over the same thing, and I usually settle for the 2nd form (nested elses) or some variation. I don't think you'll find anything much better in general, if you're looking to optimize readability (many of these answers seem significantly less readable than your candidates to me).

Sometimes if you're in an outer loop or a short function, you can use a variation of your 3rd form (the one with break statements) where you either continue or return, and that's readable enough, but I definitely wouldn't create a while True block just to avoid the "ugliness" of the other candidates.


My solution with an exemple; there is only one re.search() that is performed:

text = '''\
koala + image @ wolf - snow
Good evening, ladies and gentlemen
An uninteresting line
There were 152 ravens on a branch
sea mountain sun ocean ice hot desert river'''

import re
regx3 = re.compile('hot[ \t]+([^ ]+)')
regx2 = re.compile('(\d+|ev.+?ng)')
regx1 = re.compile('([%~#`\@+=\d]+)')
regx  = re.compile('|'.join((regx3.pattern,regx2.pattern,regx1.pattern)))

def one_func(line):
    print 'I am one_func on : '+line

def other_func(line):
    print 'I am other_func on : '+line

def another_func(line):
    print 'I am another_func on : '+line

tupl_funcs = (one_func, other_func, another_func) 


for line in text.splitlines():
    print line
    m = regx.search(line)
    if m:
        print 'm.groups() : ',m.groups()
        group_number = (i for i,m in enumerate(m.groups()) if m).next()
        print "group_number : ",group_number
        tupl_funcs[group_number](line)
    else:
        print 'No match'
        print 'No treatment'
    print

result

koala + image @ wolf - snow
m.groups() :  (None, None, '+')
group_number :  2
I am another_func on : koala + image @ wolf - snow

Good evening, ladies and gentlemen
m.groups() :  (None, 'evening', None)
group_number :  1
I am other_func on : Good evening, ladies and gentlemen

An uninteresting line
No match
No treatment

There were 152 ravens on a branch
m.groups() :  (None, '152', None)
group_number :  1
I am other_func on : There were 152 ravens on a branch

sea mountain sun ocean ice hot desert river
m.groups() :  ('desert', None, None)
group_number :  0
I am one_func on : sea mountain sun ocean ice hot desert river


Make a class with the match as state. Instantiate it before conditional, this should store the string that you are matching against as well.


You can define a class wrapping the match object with a call method to perform the match:

class ReMatcher(object):
    match = None

    def __call__(self, pattern, string):
        self.match = re.match(pattern, string)
        return self.match

    def __getattr__(self, name):
        return getattr(self.match, name)

Then call it in your conditions and use it as if it was a match object in the resulting blocks:

match = ReMatcher()

if match(regex1, line):
    print(match.group(1))

elif match(regex2, line):
    print(match.group(1))

This should work in nearly any Python version, with slight adjustments in versions before new-style classes. As in my other answer, you should use re.compile if you're concerned about regex performance.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜