extract parts of the string in python
I have to parse an input string in python and extract certain parts from it.
the format of the string is
(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it
I want a function to return xx, yyyy and a list containing aa, b开发者_开发知识库b ... etc
I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string
I have this code which works, but is there a better way to do it (without regex)
def processInput(inputStr):
value = inputStr.strip()[1:-1]
parts = value.split(',', 2)
return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')
If you're allergic to REs, you could use pyparsing:
>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']
pyparsing
also makes it easy to control the exact form of results (e.g. by grouping the last 3 items into their own sublist), if you want. But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s
above not so easy to read, I guess it's my fault for not choosing longer identifiers;-).
If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. Pyparsing supports recursive grammars using forward-declaration class Forward:
from pyparsing import *
LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)
text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results
text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results
Prints:
[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]
Let's use regular expressions!
/\(([^,]+),([^,]+),\(([^)]+)\)\)/
Match against that, first capturing group contains xx, second contains yyy, split the third on ,
and you have your list.
How about like this?
>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>
I don't know that this is better, but it's a different way to do it. Using the regex previously suggested
def processInput(inputStr):
value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
return value[0], value[1], value[2:]
Alternatively, you could use two chained replace functions in lieu of the regex.
Your solution is decent (simple, efficient). You could use regular expressions to restrict the syntax if you don't trust your data source.
import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
m = parser_re.match(input)
if m:
first = m.group(1)
second = m.group(2)
rest = m.group(3).split(",")
return (first, second, rest)
else:
return None
print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None
# can use this to unpack the various parts.
# first,second,rest = parse(...)
Prints:
('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None
精彩评论