Regex redefinition error
I am using python, and run into some redefinition error, I know they are redefinition but logically its not possible to reach that since its an or. Is there a way to get around this? I appreciate for any help in advance
/python-2.5/lib/python2.5/re.py", line 233, in _compile raise error, v # invalid expression sre_constants.error: redefinition of group name 'id' as group 9; was group 6
import re
DOB_RE = "(^|;)DOB +(?P<dob>\d{2}-\d{2}-\d{4})"
ID_RE = "(^|;)ID +(?P<id>[A-Z0-9]{12})"
INFO_RE = "- (?P<info>.*)"
PERSON_RE = "((" + DOB_RE + ".*" + ID_RE + ")|(" + \
ID_RE + ".*" + DOB_RE + ")|(" + \
DOB_RE + "|" + ID_RE + ")).*(" + INFO_RE + ")*"
PARSER = re.compile(PERSON_RE)
samplestr1 = garbage;DOB 10-10-2010;more garbage\开发者_StackOverflow中文版nID PARI12345678;more garbage
samplestr2 = garbage;ID PARI12345678;more garbage\nDOB 10-10-2010;more garbage
samplestr3 = garbage;DOB 10-10-2010
samplestr4 = garbage;ID PARI12345678;more garbage- I am cool
Regular expression syntax simply does not allow multiple occurrences of identically-named groups -- groups that aren't "reached" are defined to be "empty" (None) on a match.
So you have to change those names e.g. to dob0
, dob1
, dob2
and id0
, id1
, id2
(then you can easily "collapse" these sets of keys to make the dict you actually want after you have a groups dictionary from a match).
E.g., make the DOB_RE
a function instead of a constant, say:
def DOB_RE(i): return "(^|;)DOB +(?P<dob%s>\d{2}-\d{2}-\d{4})" % i
and similarly for the others, and change the three occurrences of DOB_RE
in the statement where you compute PERSON_RE
to DOB_RE(0)
, DOB_RE(1)
etc (and similarly for the others).
I was originally going to post a pyparsing example using the Each class (which picks out expressions that can be in any order), but then I saw that there was intermixed garbage, so searching through your string using searchString
seemed a better fit. This intrigued me because searchString
returns a sequence of ParseResults, one for each match (including any corresponding named results). So I thought, "What if I combine the returned ParseResults using sum - what a hack!", er, "How novel!" So here's a never-before-seen pyparsing hack:
from pyparsing import *
# define the separate expressions to be matched, with results names
dob_ref = "DOB" + Regex(r"\d{2}-\d{2}-\d{4}")("dob")
id_ref = "ID" + Word(alphanums,exact=12)("id")
info_ref = "-" + restOfLine("info")
# create an overall expression
person_data = dob_ref | id_ref | info_ref
for test in (samplestr1,samplestr2,samplestr3,samplestr4,):
# retrieve a list of separate matches
separate_results = person_data.searchString(test)
# combine the results using sum
# (NO ONE HAS EVER DONE THIS BEFORE!)
person = sum(separate_results, ParseResults([]))
# now we have a uber-ParseResults object!
print person.id
print person.dump()
print
Giving this output:
PARI12345678
['DOB', '10-10-2010', 'ID', 'PARI12345678']
- dob: 10-10-2010
- id: PARI12345678
PARI12345678
['ID', 'PARI12345678', 'DOB', '10-10-2010']
- dob: 10-10-2010
- id: PARI12345678
['DOB', '10-10-2010']
- dob: 10-10-2010
PARI12345678
['ID', 'PARI12345678', '-', ' I am cool']
- id: PARI12345678
- info: I am cool
But I do also speak regex. Here is a similar approach using re's.
import re
# define each individual re, with group names
dobRE = r"DOB +(?P<dob>\d{2}-\d{2}-\d{4})"
idRE = r"ID +(?P<id>[A-Z0-9]{12})"
infoRE = r"- (?P<info>.*)"
# one re to rule them all
person_dataRE = re.compile('|'.join([dobRE, idRE, infoRE]))
# using findall with person_dataRE will return a 3-tuple, so let's create
# a tuple-merger
merge = lambda a,b : tuple(aa or bb for aa,bb in zip(a,b))
# let's create a Person class to collect the different data bits
# (or if you are running Py2.6, use a namedtuple
class Person:
def __init__(self,*args):
self.dob, self.id, self.info = args
def __str__(self):
return "- id: %s\n- dob: %s\n- info: %s" % (self.id, self.dob, self.info)
for test in (samplestr1,samplestr2,samplestr3,samplestr4,):
# could have used reduce here, but let's err on the side of explicity
persontuple = ('','','')
for data in person_dataRE.findall(test):
persontuple = merge(persontuple,data)
# make a person
person = Person(*persontuple)
# print out the collected results
print person.id
print person
print
With this output:
PARI12345678
- id: PARI12345678
- dob: 10-10-2010
- info:
PARI12345678
- id: PARI12345678
- dob: 10-10-2010
- info:
- id:
- dob: 10-10-2010
- info:
PARI12345678
- id: PARI12345678
- dob:
- info: I am cool
Perhaps in this case it is better to loop through a list of regular expressions.
>>> strs=[
... "garbage;DOB 10-10-2010;more garbage\nID PARI12345678;more garbage",
... "garbage;ID PARI12345678;more garbage\nDOB 10-10-2010;more garbage",
... "garbage;DOB 10-10-2010",
... "garbage;ID PARI12345678;more garbage- I am cool"]
>>> import re
>>>
>>> DOB_RE = "(^|;|\n)DOB +(?P<dob>\d{2}-\d{2}-\d{4})"
>>> ID_RE = "(^|;|\n)ID +(?P<id>[A-Z0-9]{12})"
>>> INFO_RE = "(- (?P<info>.*))?"
>>>
>>> REGEX = map(re.compile,[DOB_RE + ".*" + ID_RE + "[^-]*" + INFO_RE,
... ID_RE + ".*" + DOB_RE + "[^-]*" + INFO_RE,
... DOB_RE + "[^-]*" + INFO_RE,
... ID_RE + "[^-]*" + INFO_RE])
>>>
>>> def get_person(s):
... for regex in REGEX:
... res = re.search(regex,s)
... if res:
... return res.groupdict()
...
>>> for s in strs:
... print get_person(s)
...
{'dob': '10-10-2010', 'info': None, 'id': 'PARI12345678'}
{'dob': '10-10-2010', 'info': None, 'id': 'PARI12345678'}
{'dob': '10-10-2010', 'info': None}
{'info': 'I am cool', 'id': 'PARI12345678'}
精彩评论