开发者

Build a Part-of-Speech Tagger (POS Tagger)

I need to build a POS tagger in Java and need to开发者_开发知识库 know how to get started. Are there code examples or other resources that help illustrate how POS taggers work?


Try Apache OpenNLP. It includes a POS Tagger tools. You can download ready-to-use English models from here.

The documentation provides details about how to use it from a Java application. Basically you need the following:

Load the POS model

InputStream modelIn = null;

try {
  modelIn = new FileInputStream("en-pos-maxent.bin");
  POSModel model = new POSModel(modelIn);
}
catch (IOException e) {
  // Model loading failed, handle the error
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

Instantiate the POS tagger

POSTaggerME tagger = new POSTaggerME(model);

Execute it

String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had", "morning", "and", "afternoon", "newspapers", "."};          
String tags[] = tagger.tag(sent);

Note that the POS tagger expects a tokenized sentence. Apache OpenNLP also provides tools and models to help with these tasks.

If you have to train your own model refer to this documentation.


You can examine existing taggers implementations.

Refer for example to Stanford University POS tagger in Java (by Kristina Toutanova), it is available under GNU General Public License (v2 or later), source code is well written and clearly documented:

http://nlp.stanford.edu/software/tagger.shtml

Good book to read about tagging is: Speech and Language Processing (2nd Edition) by Daniel Jurafsky, James H. Martin


There are a few POS/NER taggers used widely.

OpenNLP Maxent POS taggers: Using Apache OpenNLP.

Open NLP is a powerful java NLP library from Apache. It provides various tools for NLP one of which is Parts-Of-Speech (POS) tagger. Usually POS taggers are used to find out structure grammatical structure in text, you use a tagged dataset where each word (part of a phrase) is tagged with a label, you build an NLP model from this dataset and then for a new text you can use the model to generate tags for each word in the text.

Sample code:

public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

Detailed blog with the full code on how to use it:

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so

Stanford CoreNLP based NER tagger:

Stanford core NLP is by far the most battle-tested NLP library out there. In a way, it is the golden standard of NLP performance today. Among various other functionalities, named entity recognization (NER) is supported in the library, what this allows is to tag important entities in a piece of text like the name of a person, place etc.

Sample code:

public void doTagging(CRFClassifier model, String input) {
  input = input.trim();
  System.out.println(input + "=>"  +  model.classifyToString(input));
}  

Detailed blog with the full code on how to use it:

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜