How to build a IMS open source corpus workbench and NLTK readable corpus?
Currently i've a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it's readable by CWB? and also to nltk format.
Can someone lead me to a howto page to do that? or is there a guide page to do that, i've tried reading through the manual but i dont really know. www.cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf
Does it mean i create a data and registry directory and then i run the cwb-encode command and it will be all con开发者_如何学编程verted to vrt file? does it convert one file at a time? how do i script it to run through multiple file in a directory?
It's easy to produce cwb's "verticalized" format from an NLTK-readable corpus:
from nltk.corpus import brown
out = open('corpus.vrt','w')
for sentence in nltk.brown.sents():
print >>out,'<s>'
for word in sentence:
print >>out,word
print >>out,'</s>'
out.close()
From there, you can follow the instructions on the CWB website.
精彩评论