Indexing and Querying URLS in Solr

2023-02-04 07:14 问答作者：

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:

Here is my custom field type:

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="sol开发者_运维问答r.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For example:

http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper

If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

Is there a different query filter or tokenizer I should be using?

If I understand this statement from your question

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that would match both:

http://www.twitter.com/AndersonCooper

and

http://www.andersoncooper.com/socialmedia/twitter

(both links contain all of the tokens), but not match either

http://www.facebook.com/AndersonCooper

http://www.twitter.com/AliceCooper

If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:

&q=myField:andersoncooper AND myField:twitter AND myField:com

One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:

&q.op=AND&q=myField:(andersoncooper twitter com)

This should be the most simplest solution:

<field name="iconUrl" type="string" indexed="true" stored="true" />

But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www

or make the URL searchable via wildcards at the front (which is slower I guess)

You can try the keyword tokenizer

From the book Solr 1.4 Enterprise Search Server published by Packt

KeywordTokenizerFactory: This doesn't actually do any tokenization or anything at all for that matter! It returns the original text as one term. There are cases where you have a field that always gets one word, but you need to do some basic analysis like lowercasing. However, it is more likely that due to sorting or faceting requirements you will require an indexed field with no more than one term. Certainly a document's identifier field, if supplied and not a number, would use this.

继续阅读：indexing querying solr tokenize

Indexing and Querying URLS in Solr

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？