Howto perform a 'contains' search rather than 'starts with' using Lucene.Net
We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.
Currently all terms get appended a *
which leads Lucene to perform what I would classify as a StartsWith
search.
In the future we would like to have a search that performs something like a Contains
rather than a StartsWith
.
We use
- Lucene.Net 2.9.2.2
- StandardAnalyzer
- default QueryParser
Samples:
(Title:Orch*)
matches: Orchestra
but:
(Title:rch*)
does not match: Orchestra
We want the first and the second one to both match Orchestra
.
Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains
and rather than a StartsWith
by default:
How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a*
will match "Fleet Africa" because it's tokenized into "fleet" and "africa".
You need to call QueryParser.SetAllowLeadingWildcard(true)
to be able to write queries like field:*value*
. Are you actually changing the string that's passed to QueryParser?
You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery
into WildcardQuery
. That way you still support phrase searches.
I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.
@Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.
Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.
精彩评论