开发者

Indexing and Querying URLS in Solr

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:

Here is my custom field type:

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="sol开发者_运维问答r.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For example:

http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper

If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

Is there a different query filter or tokenizer I should be using?


If I understand this statement from your question

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that would match both:

http://www.twitter.com/AndersonCooper

and

http://www.andersoncooper.com/socialmedia/twitter

(both links contain all of the tokens), but not match either

http://www.facebook.com/AndersonCooper 

or

http://www.twitter.com/AliceCooper

If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:

&q=myField:andersoncooper AND myField:twitter AND myField:com

One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:

&q.op=AND&q=myField:(andersoncooper twitter com)


This should be the most simplest solution:

<field name="iconUrl" type="string" indexed="true" stored="true" />

But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www

or make the URL searchable via wildcards at the front (which is slower I guess)


You can try the keyword tokenizer

From the book Solr 1.4 Enterprise Search Server published by Packt

KeywordTokenizerFactory: This doesn't actually do any tokenization or anything at all for that matter! It returns the original text as one term. There are cases where you have a field that always gets one word, but you need to do some basic analysis like lowercasing. However, it is more likely that due to sorting or faceting requirements you will require an indexed field with no more than one term. Certainly a document's identifier field, if supplied and not a number, would use this.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜