Elegant parsing of this? "a,b,c",d,"e,f"

2023-01-06 10:23 问答作者：

I'm looking to parse these kinds of strings into lists in Python:

"a,b,c",d,"e,f"        =>  ['a','b','c'] , ['d'] , ['e','f']
"a,b,c",d,e            =>  ['a','b','c'] , ['d'] , ['e']
a,b,"c,d,e,f"          =>  ['a'],['b'],['c','d','e','f']
a,"b,c,d",{开发者_如何学编程x(a,b,c-d)} =>  ['a'],['b','c','d'],[('x',['a'],['b'],['c-d'])]

It nests, so I suspect regular expressions are out. All I can think of is to start counting quotes and brackets to parse it, but that seems horribly inelegant. Or perhaps to first match quotes and replace commas between them with somechar, then split on commas, until all the nesting is done, and finally re-split on somechar.

Any thoughts?

So, here you are, your "honest python parser". Coding for you rather than answering the question, but I will be fine if you put it to use :-)

QUOTE = '"'
SEP = ',(){}"'
S_BRACKET = '{'
E_BRACKET = '}'
S_PAREN = '('

def parse_plain(string):
    counter = 0
    token = ""
    while counter<len(string):
        if string[counter] in SEP:
            counter += 1
            break
        token += string[counter]
        counter += 1
    return counter, token

def parse_bracket(string):
    counter = 1
    fwd, token = parse_plain(string[counter:])
    output = [token]
    counter += fwd
    fwd, token = parse_(string[counter:])
    output += token
    counter += fwd
    output = [tuple(output)]
    return counter, output

def parse_quote(string):
    counter = 1
    output = []
    while counter<len(string):
        if counter > 1 and string[counter - 1] == QUOTE:
            counter += 1
            break
        fwd, token = parse_plain(string[counter:])
        output.append(token)
        counter += fwd
    return counter, output

def parse_(string):
    output = []
    counter = 0
    while counter < len(string):
        if string[counter].isalpha():
            fwd, token = parse_plain(string[counter:])
            token = [token]
        elif string[counter] == QUOTE:
            fwd, token = parse_quote(string[counter:])
        elif string[counter] == S_BRACKET:
            fwd, token = parse_bracket(string[counter:])
        elif string[counter] == E_BRACKET:
            counter += 1
            break
        else:
            counter += 1
            continue
        output.append(token)
        counter += fwd
    return counter, output

def parse(string):
    return parse_(string)[1]

And testing the output:

>>> print parse('''"a,b,c",d,"e,f"''')
[['a', 'b', 'c'], ['d'], ['e', 'f']]
>>> print parse('''"a,b,c",d,e ''')
[['a', 'b', 'c'], ['d'], ['e ']]
>>> print parse('''a,b,"c,d,e,f"''')
[['a'], ['b'], ['c', 'd', 'e', 'f']]
>>> print parse('''a,"b,c,d",{x(a,b,c-d)}''')
[['a'], ['b', 'c', 'd'], [('x', ['a'], ['b'], ['c-d'])]]
>>> print parse('''{x(a,{y("b,c,d",e)})},z''')
[[('x', ['a'], [('y', ['b', 'c', 'd'], ['e'], ['z'])])]]
>>>

One method I use in PHP for things like that is to replace the deepest point of a nested expression (in this case, "{x(a,b,c-d)}") with a symbol, like '¶1', then save its parsed value (being [('x',['a'],['b'],['c-d'])]) to the variable $nest1.

You now have the original string 'a,"b,c,d",{x(a,b,c-d)}' looking like 'a,"b,c,d",¶1' which is parsed just like the first three. Then simply search the resultant array for anything that begins with '¶' and replace it with its associated variable.

This method supports as many levels as you want, just keep looping/recursing until all the symbols are gone. For example,

'a,"b,c,d",{x(a,b,{y(j,k,l-m)},c-d)}'
'a,"b,c,d",{x(a,b,¶1,c-d)}' and $nest1=[('y',['j'],['k'],['l-m'])]
'a,"b,c,d",¶2' and $nest2=[('x',['a'],['b'],['¶1'],['c-d'])]
['a'],['b','c','d'],['¶2']
['a'],['b','c','d'],[('x',['a'],['b'],['¶1'],['c-d'])]
['a'],['b','c','d'],[('x',['a'],['b'],[('y',['j'],['k'],['l-m'])],['c-d'])]

For safety, you can even escape any instance of the ¶ that might have occurred in the string before making the change, then unescaping them as the last step, if you think it's necessary.

I don't know Python, so this might not work the same way as PHP. You may need to use an array instead of dynamic variables.

do you have quotes in strings?

If no - just replace control characters to make is compatible with JSON and use JSON parser

For the first three cases, you can just recursively apply the CSV reader:

import csv

def expand( st ):
    if "," not in st:
        return st
    return [ expand( col ) for col in csv.reader( [ st ] ).next() ]

print expand( '"a,b,c",d,"e,f"' )
print expand( '"a,b,c",d,e' )
print expand( 'a,b,"c,d,e,f"' )

继续阅读：parsing python string

Elegant parsing of this? "a,b,c",d,"e,f"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？