converting a treebank of vertical trees to s-expressions
I have a collection of parse trees, and they are in this ascii representation where indentation determines the structure (and closing brackets are implicit). I need to convert them to s-expressions so that parentheses determine the structure. It's a little bit like python's significant whitespace vs. braces. The input format is a vertical representation of trees, like so:
STA:fcl
=S:np
==DN:pron-dem("tia" <*> <Dem> <Du> <dem> DET P NOM) Tiaj
==H:n("akuzo" <act> <sd> P NOM) akuzoj
=fA:adv("certe") certe
=P:v-fin("dauxri" <va+TEMP> <mv> FUT VFIN) 开发者_运维问答 dauxros
.
Should become:
(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros) .)
I have code that almost does it, but not quite. There's always a missing paren somewhere; it's getting very frustrating. Should I use a proper parser, maybe a CFG? The current (messy) code is at http://github.com/andreasvc/eodop/blob/master/arbobanko.py
Focusing only on the example you're giving in this Q, and the Q's title about converting vertical trees to S-expressions, something like...:
import re
import sys
samp='''S
=NP
==(DT +def) the
== (N +ani) man
=VP
==V walks'''.splitlines()
relinelev = re.compile(r'(=*)(.*)')
reclean = re.compile(r'\s*\((\S+)[^)]*\)')
def clean(line):
return reclean.sub(r'\1', line)
def reparse(tree=samp):
stack = [-1]
for line in tree:
equals, rest = relinelev.match(line).groups()
linelev = len(equals)
while linelev < stack[-1]:
sys.stdout.softspace = False
print ')',
curlev = stack.pop()
if linelev == stack[-1]:
sys.stdout.softspace = False
print ')',
else:
stack.append(linelev)
print '(%s' % clean(rest),
while stack[-1] >= 0:
sys.stdout.softspace = False
print ')',
stack.pop()
print
reparse()
seems to work, and outputs
(S (NP (DT the) (N man)) (VP (V walks)))
I realize you're trying to do much more "cleaning" than I'm doing here, but that can be concentrated in the clean
function, leaving reparse
to deal with the Q's title. If you don't want to print as you go, but rather return the result as a string, the changes are of course quite minor:
def reparse(tree=samp):
stack = [-1]
result = []
for line in tree:
equals, rest = relinelev.match(line).groups()
linelev = len(equals)
while linelev < stack[-1]:
result[-1] += ')'
curlev = stack.pop()
if linelev == stack[-1]:
result[-1] += ')'
else:
stack.append(linelev)
result.append('(%s' % clean(rest))
while stack[-1] >= 0:
result[-1] += ')'
stack.pop()
return ' '.join(result)
精彩评论