Indexing and Querying URLS in Solr
I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:
Here is my custom field type:
<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="sol开发者_运维问答r.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
For example:
http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper
If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
Is there a different query filter or tokenizer I should be using?
If I understand this statement from your question
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
You are trying to write a query that would match both:
http://www.twitter.com/AndersonCooper
and
http://www.andersoncooper.com/socialmedia/twitter
(both links contain all of the tokens), but not match either
http://www.facebook.com/AndersonCooper
or
http://www.twitter.com/AliceCooper
If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:
&q=myField:andersoncooper AND myField:twitter AND myField:com
One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:
&q.op=AND&q=myField:(andersoncooper twitter com)
This should be the most simplest solution:
<field name="iconUrl" type="string" indexed="true" stored="true" />
But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www
or make the URL searchable via wildcards at the front (which is slower I guess)
You can try the keyword tokenizer
From the book Solr 1.4 Enterprise Search Server published by Packt
KeywordTokenizerFactory: This doesn't actually do any tokenization or anything at all for that matter! It returns the original text as one term. There are cases where you have a field that always gets one word, but you need to do some basic analysis like lowercasing. However, it is more likely that due to sorting or faceting requirements you will require an indexed field with no more than one term. Certainly a document's identifier field, if supplied and not a number, would use this.
精彩评论