Solr: search excludes bigger phrazes

2023-03-15 08:56 问答作者：

F.e. I have a 3 documents.

1. "dog cat a ball"

2. "dog the cat of balls"开发者_运维百科

3. "dog the cat, ball and elephant"

So. By querying "dog AND cat AND ball" I want to receive only first two documents.

So. the main idea that I want to include into results only words I requested.

I'll appreciate any advise.

thank you.

well, if you store your TermVector (while creating a Field, before adding the Document to the index, use TermVector.YES) it can be done, by overriding a Collector. here is a simple implementation (that returns only the documents without scores):

private static class MyCollector extends Collector {
    private IndexReader ir;
    private int numberOfTerms;
    private Set<Integer> set = new HashSet<Integer>();

    public MyCollector(IndexReader ir,int numberOfTerms) {
        this.ir = ir;
        this.numberOfTerms = numberOfTerms;

    }

    @Override
    public void setScorer(Scorer scorer) throws IOException {   } //we do not use a scorer in this example

    @Override
    public void setNextReader(IndexReader reader, int docBase) {
        //ignore
    }

    @Override
    public void collect(int doc) throws IOException {
        TermFreqVector vector = ir.getTermFreqVector(doc, CONTENT_FIELD);
                    //CONTENT_FILED is the name of the field you are searching in...
        if (vector != null) {
            if (vector.getTerms().length == numberOfTerms) {
                set.add(doc);
            }
        } else {
            set.add(doc); //well, assume it doesn't happen, because you stored your TermVectors.
        }

    }

    @Override
    public boolean acceptsDocsOutOfOrder() {
        return true;
    }
    public Set<Integer> getSet() { 
        return set;
    }
};

now, use IndexSearcher#search(Query,Collector)

the idea is: you know how many terms should be in the document if it is to be accepted, so you just verify it, and collect only documents that match this rule. of course this can be more complex (look for a specific term in the Vector, order of words in the Vector), but this is the general idea.

actually, if you store your TermVector, you can do almost anything, so just try working with it.

You may implement a filter factory/tokenizer pair with hashing capabilities.

Use copyfield directive
You need to tokenize terms
Remove stopwords (in your example)
Sort terms in alphanumeric order and save the hash
expand the query to also search for the hash something like:

somestring:"dog AND cat AND ball" AND somehash:"dog AND cat AND ball"

The second searchquery part will be implicitly hashed in the query processing.

this will result only in exact matches ( with a very very unrealistic probability of false positives )

P.S. you dont need to store termvectors. Which will result in a noticeable smaller index.

继续阅读：lucene solr

Solr: search excludes bigger phrazes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？