开发者

How to index word with hyphen in Lucene?

I have a StandardAnalyzer working which retrieves words and frequencies from a single document using a TermVectorMapper which is populating a HashMap.

But if I use the following text as a field in my document, i.e.

addDoc(w, "lucene Lawton-Browne Lucene");

The word frequencies returned in the HashMap are:

browne 1 lucene 2 lawton 1

The problem is the words ‘lawton’ and ‘browne’. If this is an actual ‘double-barreled’ name, can Lucene recognise it as ‘Lawton-Browne’ where the name is开发者_如何学C actually a single word?

I’ve tried combinations of:

addDoc(w, "lucene \”Lawton-Browne\” Lucene");

And single quotes but without success.

Thanks

Mr Morgan.


If you still want to be able to use a stop words list, I suggest you try the PatternAnalyzer. It allows for such a list and has a prefilled whitespace pattern.

Or you wrap the whitespace analyzer and do something like this in the tokenStream(String fieldName, Reader reader) you do something like this:

public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream stream = myWhitespaceAnalyzer.tokenStream(fieldName, Reader);
  stream = new StopFilter(stream, stopWords);
  return stream;
}


Escape the characters

see Lucene Documentation here

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜