How to extend WhitespaceTokenizer?
I need to use a tokenizer that splits words 开发者_开发知识库on whitespace but that doesn't split if the whitespace is whithin double parenthesis. Here an example:
My input-> term1 term2 term3 ((term4 term5)) term6
should produce this list of tokens:
term1, term2, term3, ((term4 term5)), term6.
I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?Thanks in advance.
I haven't tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:
\w+|\(\([\w\s]*\)\)
And a method that split a string by matched groups from the reg ex returning an array. Code example:
class Regex_ComandLine {
public static void main(String[] args) {
String input = "term1 term2 term3 ((term4 term5)) term6"; //your input
String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");
for (String arg : parsedInput) {
System.out.println(arg);
}
}
static String[] splitByMatchedGroups(String string,
String patternString) {
List<String> matchList = new ArrayList<>();
Matcher regexMatcher = Pattern.compile(patternString).matcher(string);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
return matchList.toArray(new String[0]);
}
}
The output:
term1
term2
term3
((term4 term5))
term6
Hope this help you.
Please note that the following code with the usual split()
:
String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");
will return you nothing or not what you want beacuse it only check delimiters.
You can do this by extending WhitespaceTokenizer
, but I expect it will be easier if you write a TokenFilter
that reads from a WhitespaceTokenizer
and pastes together consecutive tokens based on the number of parentheses.
Overriding incrementToken
is the main task when writing a Tokenizer
-like class. I once did this myself; the result might serve as an example (though for technical reasons, I couldn't make my class a TokenFilter
).
精彩评论