SOLR not searching on certain fields

2022-12-11 06:31 问答作者：

Just installed Solr, edited the schema.xml, and am now trying to index it and search on it with some test data.

In the XML file I'm sending to Solr, one of my fields look like this:

<field name="PageContent"><![CDATA[<p>some text in a paragrah tag</p>]]></field>

There's HTML there, so I've wrapped it in CDATA.

In my Solr schema.xml, the definition for that field looks like this:

<field name="PageContent" type="text" indexed="true" stored="true"/>

When I ran the POSTing tool, everything went ok, but when I search for content which I know is inside the PageContent field, I get no results.

开发者_StackOverflow中文版

However, when I set the <defaultSearchField> node to PageContent, it works. But if I set it to any other field, it doesn't search in PageContent.

Am I doing something wrong? what's the issue?

To clarify on the error:

I've uploaded a "doc" with the following data:

<field name="PageID">928</field>
<field name="PageName">some name</field>
<field name="PageContent"><![CDATA[<p>html content</p>]]></field>

In my schema I've defined the fields as such:

<field name="PageID" type="integer" indexed="true" stored="true" required="true"/>
<field name="PageName" type="text" indexed="true" stored="true"/>
<field name="PageContent" type="text" indexed="true" stored="true"/>

And:

<uniqueKey>PageID</uniqueKey>
<defaultSearchField>PageName</defaultSearchField>

Now, when I use the Solr admin tool and search for "some name" I get a result. But, if I search for "html content", "html", "content" or "928", I get no results

Why?

You mentioned that your default search field is set to PageName, I wouldn't expect a search for "content" to return anything.

You probably meant to put "PageContent:content" in the search box to find data in that field. If you want to search against multiple fields you'll want to check this out http://wiki.apache.org/solr/DisMaxRequestHandler. The solr admin console is not that great of a tool to play around with all the DisMax search options, you'll want to just manipulate the URL for that.

Regardless, I agree with the previous poster, if your analysis setup isn't setup up properly to deal with HTML you are likely to get all sorts of unexpected search results. Strip the HTML out and index text only.

If you want your standard query handler to search against all your fields you can change it in your solrconfig.xml (I always add a second query handler instead of modifying "standard". The qf field is the list of fields you want to search against. It's a space separated list.

<requestHandler name="standard" class="solr.DisMaxRequestHandler">

     <lst name="defaults">
            <str name="echoParams">all</str>
            <str name="hl">true</str>

            <str name="fl">*</str>
            <str name="qf">PageName PageContent</str>
     </lst>

 </requestHandler>

You are making sure that your data has been committed before you attempt to search on it, right?

Also, if you want to store raw HTML its probably best to actually remove the HTML. You can do this in your application or using Solr's solr.HTMLStripWhitespaceTokenizerFactory, like:

<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>

Which you declare in your fieldtype definition for "text". You might want to create a new field type just for your html, maybe something like text_html and you can use it like so:

<fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100"> 
      <analyzer type="index"> 
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
          <filter class="solr.StopFilterFactory" ignoreCase="true"/> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> 
          <filter class="solr.LowerCaseFilterFactory"/> 
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> 
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
      </analyzer> 
      <analyzer type="query"> 
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
          <filter class="solr.StopFilterFactory" ignoreCase="true"/> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> 
          <filter class="solr.LowerCaseFilterFactory"/> 
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> 
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
      </analyzer> 
    </fieldtype>

I am not sure what you mean by:

However, when I set the node to PageContent, it works. But if I set it to any other field, it doesn't search in PageContent.

Can you please elaborate?

fl is the list of fields returned by the query.. qf is the list you wanted to refer to and it doesn't support wild cards..

The only way to search all fields without enlisting them is to have a copyField that catches all values (not stored just indexed), then mimic searching against all fields by searching against it

In my schema.xml I have something such as the following which copy the value of each field ending with _t into the text field.

<defaultSearchField>text</defaultSearchField>
<copyField source="*_t" dest="text" maxChars="3000"/>

The parameter fl does not specify the fields to query against, but the fields to return in the response.

You could just add to schema.xml:

<field name="fieldContainingEverything"  type="text" indexed="true" stored="true"   multiValued="true" />

 <defaultSearchField>fieldContainingEverything</defaultSearchField>

 <copyField source="*" dest="fieldContainingEverything" maxChars="3000"/>

Now when indexing, every field is copied to fieldContainingEverything. The problem here is that you lose track of the field the content is coming from, if you want to further evaluate with that information. I would be glad if someone had an idea about that.

I found a somewhat functional solution:

To describe the scenario with a bit more details: I have a MySQL database table with a lot of fields to index, and do so by just importing every field without specifying every field (SELECT * FROM...). I want to query the index against every field of the table and want to know which field matched the query. This is not possible out of the box as the highlighter just tells you that the field matching the query is fieldContainingEverything. By using dismax query handler I found that even though it is said to search in every field, I don't seem to get it to search through fields which are not specified in the qf parameter. The idea now is to additionally index every field by adding:

<dynamicField name="*"  type="string"  indexed="true"  stored="true"/>

to your schema.xml. Now, when you query Solr via dismax with hl.true&hl.fl=*, you add qf=fieldContainingEverything^1 to your parameterlist. Solr now searches through every indexed field, but also highlights every field containing the query term. Downside of this methods obviously is the increased index size which should not be that relevant in most cases I assume.

继续阅读：indexing solr

SOLR not searching on certain fields

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？