For Java, there is a tokenizator that is matches exactly what I want?
I'm want to tokenize a text, but not separating only with whitespaces.
There some things like proper names that I want to set only one token (eg.: "Renato Dinhani Concei开发者_开发技巧ção"). Another case: percentual ("60 %") and not split into two tokens.
What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.
Not everything need to be universal recognition. Example: I don't need to reconigze chinese alphabet.
My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.
I would try to go about it not from a tokenization perspective, but from a rules perspective. This will be the biggest challenge - creating a comprehensive rule set that will satisfy most of your cases.
- Define in human terms what are units that should not be split up based on whitespace. The name example is one.
- For each one of those exceptions to the whitespace split, create a set of rules for how to identify it. For the name example: 2 or more consecutive capitalized words with or without language specific non-capitalized name words in between (like "de").
- Implement each rule as its own class which can be called as you loop.
- Split the entire string based on whitespace, and then loop it, keeping track of what token came before, and what is current, applying your rule classes for each token.
Example for rule isName:
- Loop 1:
(eg.:
isName = false - Loop 2:
"Renato
isName = true - Loop 3:
Dinhani
isName = true - Loop 4:
Conceição").
isName = true - Loop 5:
Another
isName = false
Leaving you with: (eg.:
, "Renato Dinhani Conceição").
, Another
I think that a tokenizer is going to be too simplistic for what you want. One step up from a tokenizer would be a lexer like JFlex. These will split up a stream of characters into separate tokens likea tokenizer but with much more flexible rules.
Even so, it seems like you're going to need some sort of natural language processing, as teaching a lexer the difference between a proper name and normal words might be tricky. You might be able to get pretty far by teaching it that a string of words that start with upper-case letters all belong together, numbers may be followed by units, etc. Good luck.
You should try Apache OpenNLP. It includes ready to use Sentence Detector and Tokenizer models for Portuguese.
Download Apache OpenNLP and extract it. Copy the Portuguese model to the OpenNLP Folder. Download the model from http://opennlp.sourceforge.net/models-1.5/
Using it from command line:
bin/opennlp TokenizerME pt-ten.bin
Loading Tokenizer model ... done (0,156s)
O José da Silva chegou, está na sua sala.
O José da Silva chegou , está na sua sala .
Using the API:
// load the model
InputStream modelIn = new FileInputStream("pt-token.bin");
try {
TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
// load the tokenizer
Tokenizer tokenizer = new TokenizerME(model);
// tokenize your sentence
String tokens[] = tokenizer.tokenize("O José da Silva chegou, está na sua sala.");
StringTokenizer is a legacy class that is maintained only for backward compatibility. It's use is discouraged in new code.
You should use the String.split() function. The split function takes a regular expression as it's argument. Additionally, you can enhance it with using the Pattern and Matcher classes. You can compile your pattern objects and then use it to match various scenarios.
精彩评论