Is POS tagging deterministic?
I have been trying to wrap my head around why this is happening but am hoping someone can shed some light on this. I am trying to tag the following text:
ae0.475 X mod
ae0.842 X mod
ae0.842 X mod
ae0.775 X mod
using the following code:
import nltk
file = open("test", "r")
for line in file:
words = line.strip().split(' ')
words = [word.strip() for word in words if word != '']
tags = nltk.pos_tag(words)
pos = [tags[x][1] for x in range(len(tags))]
key = ' '.join(pos)
print words, " : ", key
and am getting the following result:
['ae0.475', 'X', 'mod'] : NN NNP NN
['ae0.842', 'X', 'mod'] : -NONE- NNP NN
['ae0.842', 'X', 'mod'] : -NONE- 开发者_如何学JAVANNP NN
['ae0.775', 'X', 'mod'] : NN NNP NN
And I don't get it. Does anyone know what is the reason for this inconsistency? I am not very particular about the accuracy about the pos tagging because I am attempting to extract some templates but it seems to be using different tags at different instances for a word that looks "almost" the same.
As a solution, I replaced all numbers with 1 and solved the problem:
['ae1.111', 'X', 'mod'] : NN NNP NN
['ae1.111', 'X', 'mod'] : NN NNP NN
['ae1.111', 'X', 'mod'] : NN NNP NN
['ae1.111', 'X', 'mod'] : NN NNP NN
but am curious why it tagged the instance with different tags in my first case. Any suggestions?
My best effort to understand uncovered this from someone not using the whole Brown corpus:
Note that words that the tagger has not seen before, such as decried, receive a tag of None.
So, I guess something that looks like ae1.111
must appear in the corpus file, but nothing like ae0.842
. That's kind of weird, but that's the reasoning for giving the -NONE-
tag.
Edit: I got super-curious, downloaded the Brown corpus myself, and plain-text-searched inside it. The number 111
appears in it 34 times, and the number 842
only appears 4 times. 842
only appears either in the middle of dollar amounts or as the last 3 digits of a year, and 111
appears many times on its own as a page number. 775
also appears once as a page number.
So, I'm going to make a conjecture, that because of Benford's Law, you will end up matching numbers that start with 1s, 2s, and 3s much more often than numbers that start with 8s or 9s, since these are more often the page numbers of a random page that would be cited in a book. I'd be really interested in finding out if that's true (but not interested enough to do it myself, of course!).
It is "deterministic" in the sense that the same sentence is going to be tagged the same way using the same algorithm every time, but since your words aren't in nltk's data (in fact, aren't even real words in real sentences) it's going to use some algorithm to try to infer what the tags would be. That is going to mean that you can have different taggings when the words change (even if the change is a different number like you have) and that the taggings aren't going to make much sense anyway.
Which makes me wonder why you're trying to use NLP for non-natural language constructs.
精彩评论