Stanford NER toolkit - lowercase entities recognition
I am a newbie to NLP and trying to figure 开发者_如何学JAVAout how a Named Entity Recognizer annotates named entities. I am experimenting with Stanford NER toolkit. When I use the NER on standard more formal datasets where all naming conventions are followed to represent named entities such as in newswires or news blogs, the NER annotates the entities correctly. However when I run NER with informal datasets such as twitter, where named entities might not be capitalized as should have been, The NER does not annotate the entities. The classifier that I am using is a 3-CRF serialised classifer. Can anybody let me know how can I make the NER recognize lower case entities too?? Any useful suggestions on how to hack the NER and where this improvement is to be done is greatly appreciated. Thanks in advance for all your help.
I know it is an old thread but hoping it will help someone. As christopher manning has replied, the way to get lowercase detected is to replace english.muc.7class.distsim.crf.ser.gz with english.muc.7class.caseless.distsim.crf.ser.gz that you can get when you unzip the core nlp caseless jar file.
For example, in my python file I have kept everything same except changing to the new file and it works perfectly (well, most of the time)
st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')
I'm afraid there isn't an easy way to get the trained models we distribute to ignore case information at runtime. So, yes, they'll usually only label capitalized names. It would be possible to train a caseless model, which would work reasonably (but not as well on cased text, since case is a big clue in English (but not in German, Chinese, Arabic, etc.).
Along with other people's suggestions. If you're using a feature-based classifier, I would definitely add in the 100-200 most common 3-4 letter substrings in people's names or making a gazzeteer under one recognized feature. There are certain patterns that are bound to show up quite a bit in personal names that don't show up very often in other types of words, like "eli."
I think Twitter is going to be very difficult for this application. Capital letters are a big clue which, as you say, are often missing on Twitter. A dictionary check to remove valid English words is of limited use because Twitter texts include a huge number of abbreviations and they're often unique.
Perhaps PArt of Speech tagging, and frequency analysis can both be used to help improve detection of proper nouns?
The question is a bit old, but somebody else may be able to benefit from this idea.
One way to potentially train a classifier for lower case would be to run the upper case classifier that you already have against a large data set of proper English, then process that tagged text to remove case. Then you have a tagged corpus that you can use to train a new classifier. This new classifier won't be perfect against Twitter because of the peculiarities of tweets, but it's a quick way to bootstrap it.
精彩评论