How to configure the indexer so that "word1.word2" is considered as two words
supose a file 'test.txt' being indexed, the content of file is:
word1.word2
what should I do to make lucene consider "word1.word2" as two word开发者_开发问答s "word1" and "word2" not "word1.word2"
Lucene indexing with an analyzer will convert your words into Tokens of terms,(technically it converts the words into fields forming a document)
basically you can
1) create a StopAnalyzer and pass a HashSet with stop word as "."(period) this can have adverse effect on indexing(since you must use same analyzer while searching and indexing)
2) split the . with space and index them
That depends on which Analyzer
you are using. The short generic answer would be to use a SimpleAnalyzer
that uses a LetterTokenizer
. The LetterTokenizer
splits at any non-letter, thus including the dot character.
If you have more specific tokenization requirements you must code a custom Analyzer class whose tokenStream
method returns a custom TokenStream or Tokenizer object.
精彩评论