Searching hyphenated words with Lucene
I want lucene to search for hyphenated words, for eg: energy-efficient or "energy-efficient" as one single word
So if the input is energy-efficient the tokenizer generates terms like energy or efficient or energy e开发者_运维技巧fficient or energy-efficient
Therefore lucene returns with pages containing both "energy efficient" and "energy-efficient", but I want it to return exclusively with pages for energy-efficient
So the question is how can I modify the standardtokenizer to search for energy-efficient as one whole word and not break it into separate words.
Use WhitespaceAnalyzer
instead of standardAnalyzer
.
That will generate tokens dividing only on white space. But check for the other things that'll be changed.
Here is my complete blog on Lucene and Hyphen
If you want to give support of HYPHEN in StandardAnalyzer
then you have to make changes in StandardTokenizerImpl
which is responsible for tokenization.
StandardTokenizer breaks hyphenated words into two for example "energy-efficient" is tokenized as energy,efficient.
As StandardTokenizerImpl.java
is generated class from jFlex and its input file is StandardTokenizerImpl.jflex
you have to add following line in SUPPLEMENTARY.jflex-macro
which is included by StandardTokenizerImpl.jflex
MidLetterSupp = ( [\u002D] )
After that generate the StandardTokenizerImpl.java using jflex and rebuild the index.
精彩评论