In Solr, why is 'built' not being stemmed to 'build' but 'building' is?
I'm trying to figure out two things in this posting:
Why is 'built' NOT being stemmed to 'build' even though the field type definition has a stemmer defined. However, 'building' is being stemmed to 'build'
How to use Luke to examine the index to see which words got stemmed and to what. I wasn't able to see 'building' being stemmed 'build' in Luke. I know Lucene is stemming it because I am able to successfully retrieve the row with 'building' by searching for 'build'.
This link was pretty helpful but didn't answer my questions.
For reference, here is the schema.xml portions.
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use th开发者_开发知识库is less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
and the field definition is
<field name="features" type="text_en" indexed="true" stored="true" multiValued="true"/>
The data set consists of multiple documents, 1 document has 'building' in the features field, 1 documents has 'built' in the same field, and 1 document has 'Built-in' in the features field:
file : hd.xml:
<field name="features">building NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor</field>
file ipod_video.xml:
<field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field>
file sd500.xml:
<field name="features">built in flash, red-eye reduction</field>
Using Lukeall-3.3.0, This is the result I get from searching for 'features:build'. Notice that I get back 1 (instead of the expected 3 documents)
Even within that one document, I don't see the stemming, ie, I only see the original word, 'building' as shown:and, again in Luke, searching for 'features:built', returns two documents:
Selecting one of them, shows the original 'built' but not 'build'.
For exceptional cases like this, you can tune the stemming algorithm with StemmerOverrideFilter
精彩评论