Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results

2023-03-15 12:47 问答作者：

I am currently using what I (mistakenly) thought would be a fairly straightforward implementation of Solr's NGramTokenizerFactory, but I'm getting strange results that are inconsistent between the admin analyzer and actual query results, and I'm hoping for some guidance.

I am trying to get user inputs to match my NGram (minGramSize=2, maxGramSize=2) index. My schema for indexing and query time is below, in which

I strip all non alphanumeric characters using PatternReplaceCharFilter.
I tokenize with NGramTokenizerFactory.
I lowercase using LowerCaseFilterFactory (which leaves non-letter tokens in place, so my numbers will remain).

Using the schema below, I would think that a search for "PCB-1260" (with a properly escaped dash) should开发者_StackOverflow中文版 match an indexed Ngram tokenized and lowercased value of "Arochlor-1260" (i.e., the bigrams for 1260 are "12 26 60" in both the indexed value and the queried value).

Unfortunately, I get no results unless I delete the dash. [EDIT - even when I properly escape the dash and leave it in the query, I also get no results]. This seems odd because I'm doing a complete pattern replacement of all alphanumeric characters using PatternReplaceCharFilter - which I assume removes all whitespace and dashes.

The query analyzer in the admin page shows proper matching using the schema below - so I'm at a bit of a loss. Is there something fundamental about the PatternReplaceCharFilter or the NGramTokenizerFactory that I'm missing here?

I've checked the code and other posts, but can't seem to figure this one out. After a week of banging my head against the wall, I submit this one to the authority of the stack....

<fieldtype name="tokentext" class="solr.TextField" positionincrementgap="100">
    <analyzer type="index">
        <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/>
        <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9]" replacement=""/>
        <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldtype>

So - something is definitely odd with PatternReplaceCharFilter failing to remove dashes at query time. Ultimately, I just did some pre-query processing in php of the user input with preg_replace before sending to Solr, and - viola! - worked like a charm with the expected results. Puzzling that the PatternReplaceCharFilter wasn't behaving...

Here's the pre-query php code that I used to get rid of the dashes, if anyone needs it.

$pattern = '/([-])/';
$replacement = ' ';
$usrpar = preg_replace($pattern, $replacement, $raw_user_search_contents);
$res = htmlentities($usrpar, ENT_QUOTES, 'utf-8');

After that, I just passed $res to Solr...

继续阅读：n-gram regex solr

Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？