How do I get a set of grammar rules from Penn Treebank using python & NLTK?

2023-03-28 13:16 问答作者：

I'm fairly new to NLTK and Python. I've been creating sentence parses using the toy grammars given in the examples but I 开发者_如何学Cwould like to know if it's possible to use a grammar learned from a portion of the Penn Treebank, say, as opposed to just writing my own or using the toy grammars? (I'm using Python 2.7 on Mac) Many thanks

If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal

tbank_productions = set(production for sent in treebank.parsed_sents()
                        for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))

This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.

Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.

Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):

mini_grammar = ContextFreeGrammar(Nonterminal('S'),
                                  treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])

It is possible to train a Chunker on the treebank_chunk or conll2000 corpora. You don't get a grammar out of it, but you do get a pickle-able object that can parse phrase chunks. See How to Train a NLTK Chunker, Chunk Extraction with NLTK, and NLTK Classified Based Chunker Accuracy.

继续阅读：grammar nltk parsing python tagged-corpus

How do I get a set of grammar rules from Penn Treebank using python & NLTK?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？