Stanford NER toolkit - lowercase entities recognition

2023-01-26 06:15 问答作者：

I am a newbie to NLP and trying to figure 开发者_如何学JAVAout how a Named Entity Recognizer annotates named entities. I am experimenting with Stanford NER toolkit. When I use the NER on standard more formal datasets where all naming conventions are followed to represent named entities such as in newswires or news blogs, the NER annotates the entities correctly. However when I run NER with informal datasets such as twitter, where named entities might not be capitalized as should have been, The NER does not annotate the entities. The classifier that I am using is a 3-CRF serialised classifer. Can anybody let me know how can I make the NER recognize lower case entities too?? Any useful suggestions on how to hack the NER and where this improvement is to be done is greatly appreciated. Thanks in advance for all your help.

I know it is an old thread but hoping it will help someone. As christopher manning has replied, the way to get lowercase detected is to replace english.muc.7class.distsim.crf.ser.gz with english.muc.7class.caseless.distsim.crf.ser.gz that you can get when you unzip the core nlp caseless jar file.

For example, in my python file I have kept everything same except changing to the new file and it works perfectly (well, most of the time)

st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')

I'm afraid there isn't an easy way to get the trained models we distribute to ignore case information at runtime. So, yes, they'll usually only label capitalized names. It would be possible to train a caseless model, which would work reasonably (but not as well on cased text, since case is a big clue in English (but not in German, Chinese, Arabic, etc.).

Along with other people's suggestions. If you're using a feature-based classifier, I would definitely add in the 100-200 most common 3-4 letter substrings in people's names or making a gazzeteer under one recognized feature. There are certain patterns that are bound to show up quite a bit in personal names that don't show up very often in other types of words, like "eli."

I think Twitter is going to be very difficult for this application. Capital letters are a big clue which, as you say, are often missing on Twitter. A dictionary check to remove valid English words is of limited use because Twitter texts include a huge number of abbreviations and they're often unique.

Perhaps PArt of Speech tagging, and frequency analysis can both be used to help improve detection of proper nouns?

The question is a bit old, but somebody else may be able to benefit from this idea.

One way to potentially train a classifier for lower case would be to run the upper case classifier that you already have against a large data set of proper English, then process that tagged text to remove case. Then you have a tagged corpus that you can use to train a new classifier. This new classifier won't be perfect against Twitter because of the peculiarities of tweets, but it's a quick way to bootstrap it.

继续阅读：named-entity-recognition stanford-nlp

Stanford NER toolkit - lowercase entities recognition

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？