开发者

Solr: strip punctuation before index

I am having a problem with striping punctuation from the solr index When the punctuation sign follow right after a word then this word is not indexed properly.

For example: if we index "hello, John", the asset won't be found by keyword "hello" while there will be no issue if we re开发者_如何转开发move comma after word "hello".

Is there any FilterFactory that suppose to strip punctuation? Any ideas?

Thanks, Bogdan.


You can use the solr.PatternReplaceFilterFactory to strip beginning and trailing punctuation with this:

<filter class="solr.PatternReplaceFilterFactory"
    pattern="^\p{Punct}*(.*?)\p{Punct}*$"
    replacement="$1"/>

And if you wanted to strip all punctuation at the beginning and end, except (for example) the dollar-sign in front of a word, you could use this:

<filter class="solr.PatternReplaceFilterFactory"
    pattern="^[\p{Punct}&&[^$]]*(.*?)\p{Punct}*$"
    replacement="$1"/>


This is done with the WordDelimiterFilterFactory. Set generateWordParts=1.

There is also the PatternTokenizerFactory that could be used, but I have never tried it.


Use PatternReplaceFilterFactory

<!-- remove punctuation -->
    <filter class="solr.PatternReplaceFilterFactory" pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>

...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜