How I train an Named Entity Recognizer identifier in OpenNLP?

2023-03-25 06:14 问答作者：

Ok, I have the following code to train the NER Identifier from OpenNLP

FileReader fileReader = new FileReader("train.txt");
ObjectStream fileStream = new PlainTextByLineStream(fileReader);
ObjectStream sampleStream = new NameSampleDataStream(fileStream);
TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.<String, Object>emptyMap());
nfm = new NameFinderME(model);

I don't know if I'm doing something wrong of if something is missing, but the classifying is not working. I'm supposing that the train.txt is wrong.

The error that occurs is that all tokens are classified to only one type.

My train.txt data is something like the following example, but with a lot more of variation and quantity of entries. Another thing is that I'm classifin开发者_运维百科d word by word from a text per time, and not all tokens.

<START:distance> 8000m <END>
<START:temperature> 100ºC <END>
<START:weight> 50kg <END>
<START:name> Renato <END>

Somebody can show what I doing wrong?

Your training data is not OK.

You should put all entities in a context inside a sentence:

At an altitude of <START:distance> 8000m <END> the temperature of boiling water is less than <START:temperature> 100ºC <END> .
The climber <START:name> Renato <END> is carrying <START:weight> 50kg <END> of equipment.

You will have better results if your training data derives from real world sentences and have the same style of the sentences you are classifying. For example you should train using a newspaper corpus if you will process news.

Also you will need thousands of sentences to build your model! Maybe you can start with a hundred to bootstrap and use the poor model to improve your corpus and train your model again.

And of course you should classify all tokens of a sentence, otherwise there will be no context to decide the type of an entity.

继续阅读：named-entity-recognition opennlp

How I train an Named Entity Recognizer identifier in OpenNLP?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？