Is POS tagging deterministic?

2023-03-16 19:54 问答作者：

I have been trying to wrap my head around why this is happening but am hoping someone can shed some light on this. I am trying to tag the following text:

ae0.475      X  mod 
ae0.842      X  mod
ae0.842      X  mod 
ae0.775      X  mod

using the following code:

import nltk

file = open("test", "r")

for line in file:
        words = line.strip().split(' ')
        words = [word.strip() for word in words if word != '']
        tags = nltk.pos_tag(words)
        pos = [tags[x][1] for x in range(len(tags))]
        key = ' '.join(pos)
        print words, " : ", key

and am getting the following result:

['ae0.475', 'X', 'mod']  :  NN NNP NN
['ae0.842', 'X', 'mod']  :  -NONE- NNP NN
['ae0.842', 'X', 'mod']  :  -NONE- 开发者_如何学JAVANNP NN
['ae0.775', 'X', 'mod']  :  NN NNP NN

And I don't get it. Does anyone know what is the reason for this inconsistency? I am not very particular about the accuracy about the pos tagging because I am attempting to extract some templates but it seems to be using different tags at different instances for a word that looks "almost" the same.

As a solution, I replaced all numbers with 1 and solved the problem:

['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN

but am curious why it tagged the instance with different tags in my first case. Any suggestions?

My best effort to understand uncovered this from someone not using the whole Brown corpus:

Note that words that the tagger has not seen before, such as decried, receive a tag of None.

So, I guess something that looks like ae1.111 must appear in the corpus file, but nothing like ae0.842. That's kind of weird, but that's the reasoning for giving the -NONE- tag.

Edit: I got super-curious, downloaded the Brown corpus myself, and plain-text-searched inside it. The number 111 appears in it 34 times, and the number 842 only appears 4 times. 842 only appears either in the middle of dollar amounts or as the last 3 digits of a year, and 111 appears many times on its own as a page number. 775 also appears once as a page number.

So, I'm going to make a conjecture, that because of Benford's Law, you will end up matching numbers that start with 1s, 2s, and 3s much more often than numbers that start with 8s or 9s, since these are more often the page numbers of a random page that would be cited in a book. I'd be really interested in finding out if that's true (but not interested enough to do it myself, of course!).

It is "deterministic" in the sense that the same sentence is going to be tagged the same way using the same algorithm every time, but since your words aren't in nltk's data (in fact, aren't even real words in real sentences) it's going to use some algorithm to try to infer what the tags would be. That is going to mean that you can have different taggings when the words change (even if the change is a different number like you have) and that the taggings aren't going to make much sense anyway.

Which makes me wonder why you're trying to use NLP for non-natural language constructs.

继续阅读：machine-learning nltk python

Is POS tagging deterministic?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Best solution for private video database [closed]

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML