custom tagging with nltk
I'm trying to create a small english-like language for specifying tasks. The basic idea开发者_如何学JAVA is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:
>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.
So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?
One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:
>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
Then you get
>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]
This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py
from nltk-trainer and an appropriate corpus.
Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.
For example, consider the three sentences:
select the files
use the select function on the sockets
the select was good
Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.
import nltk.tag, nltk.data
from nltk import word_tokenize
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
def evaluate(tagger, sentences):
good,total = 0,0.
for sentence,func in sentences:
tags = tagger.tag(nltk.word_tokenize(sentence))
print tags
good += func(tags)
total += 1
print 'Accuracy:',good/total
sentences = [
('select the files', lambda tags: ('select', 'VB') in tags),
('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),
('the select was good', lambda tags: ('select', 'NN') in tags),
]
train_sents = [
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],
[('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],
[('the', 'DT'), ('select', 'NN'), ('files', 'NNS')],
]
tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger)
evaluate(tagger, sentences)
#model = tagger._context_to_tag
Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.
See Jacob's answer.
In later versions (at least nltk 3.2) nltk.tag._POS_TAGGER
does not exist. The default taggers are usually downloaded into the nltk_data/taggers/ directory, e.g.:
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
Usage is as follows.
>>> import nltk.tag, nltk.data
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
>>> default_tagger = nltk.data.load(tagger_path)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
See also: How to do POS tagging using the NLTK POS tagger in Python.
Bud's answer is correct. Also, according to this link,
if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path.
Meaning, you can just say
tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
default_tagger = nltk.data.load(tagger_path)
精彩评论