开发者

Solr: search excludes bigger phrazes

F.e. I have a 3 documents.

1. "dog cat a ball"

2. "dog the cat of balls"开发者_运维百科

3. "dog the cat, ball and elephant"

So. By querying "dog AND cat AND ball" I want to receive only first two documents.

So. the main idea that I want to include into results only words I requested.

I'll appreciate any advise.

thank you.


well, if you store your TermVector (while creating a Field, before adding the Document to the index, use TermVector.YES) it can be done, by overriding a Collector. here is a simple implementation (that returns only the documents without scores):

private static class MyCollector extends Collector {
    private IndexReader ir;
    private int numberOfTerms;
    private Set<Integer> set = new HashSet<Integer>();

    public MyCollector(IndexReader ir,int numberOfTerms) {
        this.ir = ir;
        this.numberOfTerms = numberOfTerms;

    }

    @Override
    public void setScorer(Scorer scorer) throws IOException {   } //we do not use a scorer in this example

    @Override
    public void setNextReader(IndexReader reader, int docBase) {
        //ignore
    }

    @Override
    public void collect(int doc) throws IOException {
        TermFreqVector vector = ir.getTermFreqVector(doc, CONTENT_FIELD);
                    //CONTENT_FILED is the name of the field you are searching in...
        if (vector != null) {
            if (vector.getTerms().length == numberOfTerms) {
                set.add(doc);
            }
        } else {
            set.add(doc); //well, assume it doesn't happen, because you stored your TermVectors.
        }

    }

    @Override
    public boolean acceptsDocsOutOfOrder() {
        return true;
    }
    public Set<Integer> getSet() { 
        return set;
    }
}; 

now, use IndexSearcher#search(Query,Collector)

the idea is: you know how many terms should be in the document if it is to be accepted, so you just verify it, and collect only documents that match this rule. of course this can be more complex (look for a specific term in the Vector, order of words in the Vector), but this is the general idea.

actually, if you store your TermVector, you can do almost anything, so just try working with it.


You may implement a filter factory/tokenizer pair with hashing capabilities.

  1. Use copyfield directive
  2. You need to tokenize terms
  3. Remove stopwords (in your example)
  4. Sort terms in alphanumeric order and save the hash
  5. expand the query to also search for the hash something like:

somestring:"dog AND cat AND ball" AND somehash:"dog AND cat AND ball"

The second searchquery part will be implicitly hashed in the query processing.

this will result only in exact matches ( with a very very unrealistic probability of false positives )

P.S. you dont need to store termvectors. Which will result in a noticeable smaller index.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜