开发者

Pyparsing: Attempt to be non-greedy causes infinite loop

I'm trying to create a parser for the RCS file format, however, it experiences an infinite loop when trying to parse RCSid in the context of RCSadmin. Removing the offending line

        Group(ZeroOrMore(RCSid)).setResultsName('access') + \

causes the hang not to occur. RCSid on its succeeds in parsing string. Any suggestions?

Here's what I have:

from   pyparsing import *
import string

# Special characters in the RCS file format
special = '$,.:;@'

RCSdigit = Word(nums, min=1, max=1).setName('RCSdigit')
RCSnum = Word(nums + '.').setName('RCSnum')
RCSidchar = CharsNotIn(special + string.whitespace).setName('RCSidchar')
RCSid = Combine(Optional(RCSnum) + ZeroOrMore(RCSidchar +
        ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSadmin = \
    Keyword('head').suppress() + \
        Optional(RCSnum).setResultsName('head') + \
        Suppress(';') + \
    Optional(Keyword('branch').suppress() +
        Optional(RCSnum).setResultsName('branch') +
        Suppress(';')
    ) + \
    Keyword('access').suppress() + \
        Group(ZeroOrMore(RCSid)).setResultsName('access') + \
        Suppress(';')

ids = ['.111abc111', '1111abc111', '1.11', '1', '1abc', 'abc',
        'abc1', 'abc1.11', 'abc.1111', '']
for i in ids:
    try:
        print i, RCSid.parseString(i)
    except ParseException, pe:
        print开发者_Python百科 pe.markInputline()
for i in ids:
    line = 'head 3; branch 1; access ' + i + ';'
    try:
        print line, RCSadmin.parseString(line)
    except ParseException, pe:
        print pe.markInputline()

with output (^C at hang):

.111abc111 ['.111abc111']
1111abc111 ['1111abc111']
1.11 ['1.11']
1 ['1']
1abc ['1abc']
abc ['abc']
abc1 ['abc1']
abc1.11 ['abc1.11']
abc.1111 ['abc.1111']
 ['']
^Chead 3; branch 1; access .111abc111;
Traceback (most recent call last):
  File "sample.py", line 35, in <module>
    print line, RCSadmin.parseString(line)
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 1070, in parseString
    loc, tokens = self._parse( instring, 0 )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2352, in parseImpl
    loc, exprtokens = e._parse( instring, loc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2604, in parseImpl
    return self.expr._parse( instring, loc, doActions, callPreParse=False )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2724, in parseImpl
    loc, tmptokens = self.expr._parse( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2604, in parseImpl
    return self.expr._parse( instring, loc, doActions, callPreParse=False )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2336, in parseImpl
    loc, resultlist = self.exprs[0]._parse( instring, loc, doActions, callPreParse=False )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 943, in _parseNoCache
    if self.mayIndexError or loc >= len(instring):
KeyboardInterrupt


Is an empty string really a valid RCSid? I suspect not. Now it may be possible for the RCSid to be omitted in the access clause of your admin statement, but you are already handling that with the ZeroOrMore. Define your primitives as they are specified, and then factor in Optional, ZeroOrMore etc, in the higher-level constructs.

Changing RCSid to:

RCSid = Combine(RCSnum + ZeroOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))
                |
                OneOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')

gives me a result that still matches all your test cases (except for matching the ''), and properly parses the full RCSAdmin strings.

EDIT Here is my complete parser, works with pyparsing 1.5.6:

# Special characters in the RCS file format
special = '$,.:;@'

RCSdigit = Word(nums, min=1, max=1).setName('RCSdigit')
RCSnum = Word(nums + '.').setName('RCSnum')
RCSidchar = CharsNotIn(special + string.whitespace).setName('RCSidchar')
#~ RCSid = Combine(Optional(RCSnum) + ZeroOrMore(RCSidchar +
        #~ ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSid = Combine(RCSnum + ZeroOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))
                |
                OneOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSadmin = \
    Keyword('head').suppress() + \
        Optional(RCSnum).setResultsName('head') + \
        Suppress(';') + \
    Optional(Keyword('branch').suppress() +
        Optional(RCSnum).setResultsName('branch') + 
        Suppress(';')
    ) + \
    Keyword('access').suppress() + \
        Group(ZeroOrMore(RCSid)).setResultsName('access') + \
        Suppress(';') 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜