Pyparsing: Attempt to be non-greedy causes infinite loop
I'm trying to create a parser for the RCS file format, however, it experiences an infinite loop when trying to parse RCSid in the context of RCSadmin. Removing the offending line
Group(ZeroOrMore(RCSid)).setResultsName('access') + \
causes the hang not to occur. RCSid on its succeeds in parsing string. Any suggestions?
Here's what I have:
from pyparsing import *
import string
# Special characters in the RCS file format
special = '$,.:;@'
RCSdigit = Word(nums, min=1, max=1).setName('RCSdigit')
RCSnum = Word(nums + '.').setName('RCSnum')
RCSidchar = CharsNotIn(special + string.whitespace).setName('RCSidchar')
RCSid = Combine(Optional(RCSnum) + ZeroOrMore(RCSidchar +
ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSadmin = \
Keyword('head').suppress() + \
Optional(RCSnum).setResultsName('head') + \
Suppress(';') + \
Optional(Keyword('branch').suppress() +
Optional(RCSnum).setResultsName('branch') +
Suppress(';')
) + \
Keyword('access').suppress() + \
Group(ZeroOrMore(RCSid)).setResultsName('access') + \
Suppress(';')
ids = ['.111abc111', '1111abc111', '1.11', '1', '1abc', 'abc',
'abc1', 'abc1.11', 'abc.1111', '']
for i in ids:
try:
print i, RCSid.parseString(i)
except ParseException, pe:
print开发者_Python百科 pe.markInputline()
for i in ids:
line = 'head 3; branch 1; access ' + i + ';'
try:
print line, RCSadmin.parseString(line)
except ParseException, pe:
print pe.markInputline()
with output (^C at hang):
.111abc111 ['.111abc111']
1111abc111 ['1111abc111']
1.11 ['1.11']
1 ['1']
1abc ['1abc']
abc ['abc']
abc1 ['abc1']
abc1.11 ['abc1.11']
abc.1111 ['abc.1111']
['']
^Chead 3; branch 1; access .111abc111;
Traceback (most recent call last):
File "sample.py", line 35, in <module>
print line, RCSadmin.parseString(line)
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 1070, in parseString
loc, tokens = self._parse( instring, 0 )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2352, in parseImpl
loc, exprtokens = e._parse( instring, loc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2604, in parseImpl
return self.expr._parse( instring, loc, doActions, callPreParse=False )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2724, in parseImpl
loc, tmptokens = self.expr._parse( instring, preloc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2604, in parseImpl
return self.expr._parse( instring, loc, doActions, callPreParse=False )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2336, in parseImpl
loc, resultlist = self.exprs[0]._parse( instring, loc, doActions, callPreParse=False )
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 943, in _parseNoCache
if self.mayIndexError or loc >= len(instring):
KeyboardInterrupt
Is an empty string really a valid RCSid? I suspect not. Now it may be possible for the RCSid to be omitted in the access clause of your admin statement, but you are already handling that with the ZeroOrMore. Define your primitives as they are specified, and then factor in Optional, ZeroOrMore etc, in the higher-level constructs.
Changing RCSid to:
RCSid = Combine(RCSnum + ZeroOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))
|
OneOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
gives me a result that still matches all your test cases (except for matching the ''), and properly parses the full RCSAdmin strings.
EDIT Here is my complete parser, works with pyparsing 1.5.6:
# Special characters in the RCS file format
special = '$,.:;@'
RCSdigit = Word(nums, min=1, max=1).setName('RCSdigit')
RCSnum = Word(nums + '.').setName('RCSnum')
RCSidchar = CharsNotIn(special + string.whitespace).setName('RCSidchar')
#~ RCSid = Combine(Optional(RCSnum) + ZeroOrMore(RCSidchar +
#~ ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSid = Combine(RCSnum + ZeroOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))
|
OneOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSadmin = \
Keyword('head').suppress() + \
Optional(RCSnum).setResultsName('head') + \
Suppress(';') + \
Optional(Keyword('branch').suppress() +
Optional(RCSnum).setResultsName('branch') +
Suppress(';')
) + \
Keyword('access').suppress() + \
Group(ZeroOrMore(RCSid)).setResultsName('access') + \
Suppress(';')
精彩评论