开发者

converting a treebank of vertical trees to s-expressions

I have a collection of parse trees, and they are in this ascii representation where indentation determines the structure (and closing brackets are implicit). I need to convert them to s-expressions so that parentheses determine the structure. It's a little bit like python's significant whitespace vs. braces. The input format is a vertical representation of trees, like so:

STA:fcl
=S:np
==DN:pron-dem("tia" <*> <Dem> <Du> <dem> DET P NOM)     Tiaj
==H:n("akuzo" <act> <sd> P NOM) akuzoj
=fA:adv("certe")        certe
=P:v-fin("dauxri" <va+TEMP> <mv> FUT VFIN)   开发者_运维问答   dauxros
.

Should become:

(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros) .)

I have code that almost does it, but not quite. There's always a missing paren somewhere; it's getting very frustrating. Should I use a proper parser, maybe a CFG? The current (messy) code is at http://github.com/andreasvc/eodop/blob/master/arbobanko.py


Focusing only on the example you're giving in this Q, and the Q's title about converting vertical trees to S-expressions, something like...:

import re
import sys

samp='''S
=NP
==(DT +def) the
== (N +ani) man
=VP
==V walks'''.splitlines()

relinelev = re.compile(r'(=*)(.*)')
reclean = re.compile(r'\s*\((\S+)[^)]*\)')

def clean(line):
  return reclean.sub(r'\1', line)

def reparse(tree=samp):
  stack = [-1]
  for line in tree:
    equals, rest = relinelev.match(line).groups()
    linelev = len(equals)
    while linelev < stack[-1]:
      sys.stdout.softspace = False
      print ')',
      curlev = stack.pop()
    if linelev == stack[-1]:
      sys.stdout.softspace = False
      print ')',
    else:
      stack.append(linelev)
    print '(%s' % clean(rest),
  while stack[-1] >= 0:
    sys.stdout.softspace = False
    print ')',
    stack.pop()
  print

reparse()

seems to work, and outputs

(S (NP (DT the) (N man)) (VP (V walks)))

I realize you're trying to do much more "cleaning" than I'm doing here, but that can be concentrated in the clean function, leaving reparse to deal with the Q's title. If you don't want to print as you go, but rather return the result as a string, the changes are of course quite minor:

def reparse(tree=samp):
  stack = [-1]
  result = []
  for line in tree:
    equals, rest = relinelev.match(line).groups()
    linelev = len(equals)
    while linelev < stack[-1]:
      result[-1] += ')'
      curlev = stack.pop()
    if linelev == stack[-1]:
      result[-1] += ')'
    else:
      stack.append(linelev)
    result.append('(%s' % clean(rest))
  while stack[-1] >= 0:
    result[-1] += ')'
    stack.pop()
  return ' '.join(result)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜