For Java, there is a tokenizator that is matches exactly what I want?

2023-03-23 10:41 问答作者：

I'm want to tokenize a text, but not separating only with whitespaces.

There some things like proper names that I want to set only one token (eg.: "Renato Dinhani Concei开发者_开发技巧ção"). Another case: percentual ("60 %") and not split into two tokens.

What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.

Not everything need to be universal recognition. Example: I don't need to reconigze chinese alphabet.

My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.

I would try to go about it not from a tokenization perspective, but from a rules perspective. This will be the biggest challenge - creating a comprehensive rule set that will satisfy most of your cases.

Define in human terms what are units that should not be split up based on whitespace. The name example is one.
For each one of those exceptions to the whitespace split, create a set of rules for how to identify it. For the name example: 2 or more consecutive capitalized words with or without language specific non-capitalized name words in between (like "de").
Implement each rule as its own class which can be called as you loop.
Split the entire string based on whitespace, and then loop it, keeping track of what token came before, and what is current, applying your rule classes for each token.

Example for rule isName:

Loop 1: (eg.: isName = false
Loop 2: "Renato isName = true
Loop 3: Dinhani isName = true
Loop 4: Conceição"). isName = true
Loop 5: Another isName = false

Leaving you with: (eg.:, "Renato Dinhani Conceição")., Another

I think that a tokenizer is going to be too simplistic for what you want. One step up from a tokenizer would be a lexer like JFlex. These will split up a stream of characters into separate tokens likea tokenizer but with much more flexible rules.

Even so, it seems like you're going to need some sort of natural language processing, as teaching a lexer the difference between a proper name and normal words might be tricky. You might be able to get pretty far by teaching it that a string of words that start with upper-case letters all belong together, numbers may be followed by units, etc. Good luck.

You should try Apache OpenNLP. It includes ready to use Sentence Detector and Tokenizer models for Portuguese.

Download Apache OpenNLP and extract it. Copy the Portuguese model to the OpenNLP Folder. Download the model from http://opennlp.sourceforge.net/models-1.5/

Using it from command line:

bin/opennlp TokenizerME pt-ten.bin 
Loading Tokenizer model ... done (0,156s)
O José da Silva chegou, está na sua sala.
O José da Silva chegou , está na sua sala .

Using the API:

// load the model
InputStream modelIn = new FileInputStream("pt-token.bin");

try {
  TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the tokenizer
Tokenizer tokenizer = new TokenizerME(model);

// tokenize your sentence
String tokens[] = tokenizer.tokenize("O José da Silva chegou, está na sua sala.");

StringTokenizer is a legacy class that is maintained only for backward compatibility. It's use is discouraged in new code.

You should use the String.split() function. The split function takes a regular expression as it's argument. Additionally, you can enhance it with using the Pattern and Matcher classes. You can compile your pattern objects and then use it to match various scenarios.

继续阅读：text tokenize

For Java, there is a tokenizator that is matches exactly what I want?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？