开发者

How can I build a regular expression which has options part

How can I build a regular expression in python which can match all the following? where it is a "string (a-zA-Z)" follow by a space follow by 1 or multiple 4 integers which separates by a comma:

Example:

someotherstring 42 1 48 17,

somestring 363 1 46 17,363 1 34 17,401 3 8 14,

otherstring 42 1 48 17,363 1 34 17,

I have tried the following, since I need to know each integers:

myRE=re.compile("(\s+) ((\d+) (\d+) (\d+) (\d+),)+"

But how can I find开发者_StackOverflow out how many 4 integers I have? and how can I process each of them?

Thank you.


>>> test = "somestring 363 1 46 17,363 1 34 17,401 3 8 14,"

Here is a pyparsing processor for your input string:

>>> from pyparsing import *
>>> integer = Word(nums)
>>> patt = Word(alphas) + OneOrMore(Group(integer*4 + Suppress(',')))

Using patt.parseString returns a pyparsing ParseResults object, which has some nice list/dict/object properties. First, just printing out the results as a list:

>>> patt.parseString(test).asList()
['somestring', ['363', '1', '46', '17'], ['363', '1', '34', '17'], ['401', '3', '8', '14']]

See how each of your groups is grouped as a sublist?

Now let's have the parser do a bit more work for us. At parse time, we already know we are parsing valid integers - anything matching Word(nums) has to be an integer. So we can add a parse action to do this conversion at parse time:

>>> integer = Word(nums).setParseAction(lambda tokens:int(tokens[0]))

Now, we recreate our pattern, and parsing now gives us groups of numbers:

>>> patt = Word(alphas) + OneOrMore(Group(integer*4 + Suppress(',')))
>>> patt.parseString(test).asList()
['somestring', [363, 1, 46, 17], [363, 1, 34, 17], [401, 3, 8, 14]]

Lastly, we can also assign names to the bits parsed out of this input:

>>> patt = Word(alphas)("desc") + OneOrMore(Group(integer*4 + Suppress(',')))("numgroups")

The list of returned items is the same:

>>> patt.parseString(test).asList()
['somestring', [363, 1, 46, 17], [363, 1, 34, 17], [401, 3, 8, 14]]

But if we dump() the results, we see what we can access by name:

>>> print patt.parseString(test).dump()
['somestring', [363, 1, 46, 17], [363, 1, 34, 17], [401, 3, 8, 14]]
- desc: somestring
- numgroups: [[363, 1, 46, 17], [363, 1, 34, 17], [401, 3, 8, 14]]

We can use those names for dict-like or attribute-like access. I'm partial to the attribute style myself:

>>> res = patt.parseString(test)
>>> print res.desc
somestring
>>> print res.numgroups
[[363, 1, 46, 17], [363, 1, 34, 17], [401, 3, 8, 14]]
>>> for ng in res.numgroups: print sum(ng)
...
427
415
426

Here is the entire parser and output processor:

test = "somestring 363 1 46 17,363 1 34 17,401 3 8 14,"
from pyparsing import *
integer = Word(nums).setParseAction(lambda tokens:int(tokens[0]))
patt = Word(alphas)("desc") + \
    OneOrMore(Group(integer*4 + Suppress(',')))("numgroups")

print patt.parseString(test).asList()
print patt.parseString(test).dump()
res = patt.parseString(test)
print res.desc
print res.numgroups
for ng in res.numgroups: 
    print sum(ng)


As your data units (as I called them above) are separated by a comma AND a whitespace, you could still use split :)

data = "someotherstring 42 1 48 17, somestring 363 1 46 17,363 1 34 17,401 3 8 14, otherstring 42 1 48 17,363 1 34 17"

data_items = data.split(', ')
for item in data_items:
    section_title, intdata = item.split(' ', 1)
    print 'Processing %s' % section_title
    for ints in intdata.split(','):
        a, b, c, d = [int(x) for x in ints.split()]
        # do your stuff ...


import re
str_in = "someotherstring 42 1 48 17, somestring 363 1 46 17,363 1 34 17,401 3 8 14, otherstring 42 1 48 17,363 1 34 17,"
list_out = re.split("[\\s,]", str_in)

list_out then contains a list where the name of each section is followed by all the integers (still as strings), then a blank entry (useful for delimiting sections), and so on:

['someotherstring', '42', '1', '48', '17', '', 'somestring', '363', '1', '46', '17', '363', '1', '34', '17', '401', '3', '8', '14', '', 'otherstring', '42', '1', '48', '17', '363', '1', '34', '17', '']
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜