How to index word with hyphen in Lucene?
I have a StandardAnalyzer working which retrieves words and frequencies from a single document using a TermVectorMapper which is populating a HashMap.
But if I use the following text as a field in my document, i.e.
addDoc(w, "lucene Lawton-Browne Lucene");
The word frequencies returned in the HashMap are:
browne 1 lucene 2 lawton 1
The problem is the words ‘lawton’ and ‘browne’. If this is an actual ‘double-barreled’ name, can Lucene recognise it as ‘Lawton-Browne’ where the name is开发者_如何学C actually a single word?
I’ve tried combinations of:
addDoc(w, "lucene \”Lawton-Browne\” Lucene");
And single quotes but without success.
Thanks
Mr Morgan.
If you still want to be able to use a stop words list, I suggest you try the PatternAnalyzer. It allows for such a list and has a prefilled whitespace pattern.
Or you wrap the whitespace analyzer and do something like this in the tokenStream(String fieldName, Reader reader) you do something like this:
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = myWhitespaceAnalyzer.tokenStream(fieldName, Reader);
stream = new StopFilter(stream, stopWords);
return stream;
}
Escape the characters
see Lucene Documentation here
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters
精彩评论