Elegant parsing of this? "a,b,c",d,"e,f"
I'm looking to parse these kinds of strings into lists in Python:
"a,b,c",d,"e,f" => ['a','b','c'] , ['d'] , ['e','f']
"a,b,c",d,e => ['a','b','c'] , ['d'] , ['e']
a,b,"c,d,e,f" => ['a'],['b'],['c','d','e','f']
a,"b,c,d",{开发者_如何学编程x(a,b,c-d)} => ['a'],['b','c','d'],[('x',['a'],['b'],['c-d'])]
It nests, so I suspect regular expressions are out. All I can think of is to start counting quotes and brackets to parse it, but that seems horribly inelegant. Or perhaps to first match quotes and replace commas between them with somechar, then split on commas, until all the nesting is done, and finally re-split on somechar.
Any thoughts?
So, here you are, your "honest python parser". Coding for you rather than answering the question, but I will be fine if you put it to use :-)
QUOTE = '"'
SEP = ',(){}"'
S_BRACKET = '{'
E_BRACKET = '}'
S_PAREN = '('
def parse_plain(string):
counter = 0
token = ""
while counter<len(string):
if string[counter] in SEP:
counter += 1
break
token += string[counter]
counter += 1
return counter, token
def parse_bracket(string):
counter = 1
fwd, token = parse_plain(string[counter:])
output = [token]
counter += fwd
fwd, token = parse_(string[counter:])
output += token
counter += fwd
output = [tuple(output)]
return counter, output
def parse_quote(string):
counter = 1
output = []
while counter<len(string):
if counter > 1 and string[counter - 1] == QUOTE:
counter += 1
break
fwd, token = parse_plain(string[counter:])
output.append(token)
counter += fwd
return counter, output
def parse_(string):
output = []
counter = 0
while counter < len(string):
if string[counter].isalpha():
fwd, token = parse_plain(string[counter:])
token = [token]
elif string[counter] == QUOTE:
fwd, token = parse_quote(string[counter:])
elif string[counter] == S_BRACKET:
fwd, token = parse_bracket(string[counter:])
elif string[counter] == E_BRACKET:
counter += 1
break
else:
counter += 1
continue
output.append(token)
counter += fwd
return counter, output
def parse(string):
return parse_(string)[1]
And testing the output:
>>> print parse('''"a,b,c",d,"e,f"''')
[['a', 'b', 'c'], ['d'], ['e', 'f']]
>>> print parse('''"a,b,c",d,e ''')
[['a', 'b', 'c'], ['d'], ['e ']]
>>> print parse('''a,b,"c,d,e,f"''')
[['a'], ['b'], ['c', 'd', 'e', 'f']]
>>> print parse('''a,"b,c,d",{x(a,b,c-d)}''')
[['a'], ['b', 'c', 'd'], [('x', ['a'], ['b'], ['c-d'])]]
>>> print parse('''{x(a,{y("b,c,d",e)})},z''')
[[('x', ['a'], [('y', ['b', 'c', 'd'], ['e'], ['z'])])]]
>>>
One method I use in PHP for things like that is to replace the deepest point of a nested expression (in this case, "{x(a,b,c-d)}") with a symbol, like '¶1', then save its parsed value (being [('x',['a'],['b'],['c-d'])]) to the variable $nest1.
You now have the original string 'a,"b,c,d",{x(a,b,c-d)}' looking like 'a,"b,c,d",¶1' which is parsed just like the first three. Then simply search the resultant array for anything that begins with '¶' and replace it with its associated variable.
This method supports as many levels as you want, just keep looping/recursing until all the symbols are gone. For example,
'a,"b,c,d",{x(a,b,{y(j,k,l-m)},c-d)}'
'a,"b,c,d",{x(a,b,¶1,c-d)}' and $nest1=[('y',['j'],['k'],['l-m'])]
'a,"b,c,d",¶2' and $nest2=[('x',['a'],['b'],['¶1'],['c-d'])]
['a'],['b','c','d'],['¶2']
['a'],['b','c','d'],[('x',['a'],['b'],['¶1'],['c-d'])]
['a'],['b','c','d'],[('x',['a'],['b'],[('y',['j'],['k'],['l-m'])],['c-d'])]
For safety, you can even escape any instance of the ¶ that might have occurred in the string before making the change, then unescaping them as the last step, if you think it's necessary.
I don't know Python, so this might not work the same way as PHP. You may need to use an array instead of dynamic variables.
do you have quotes in strings?
If no - just replace control characters to make is compatible with JSON and use JSON parser
For the first three cases, you can just recursively apply the CSV reader:
import csv
def expand( st ):
if "," not in st:
return st
return [ expand( col ) for col in csv.reader( [ st ] ).next() ]
print expand( '"a,b,c",d,"e,f"' )
print expand( '"a,b,c",d,e' )
print expand( 'a,b,"c,d,e,f"' )
精彩评论