Lucene custom scoring for numeric fields

2023-03-04 20:22 问答作者：

I would like to have, in addition to standard term search with tf-idf similarity over text content field, scoring based on "similarity" of numeric fields. This similarity will be depending on distance between the value in query and in document (e.g. gaussian with m= [user input], s= 0.5)

I.e. let's say documents represent people, and pers开发者_如何学Pythonon document have two fields:

description (full text)
age (numeric).

I want to find documents like

description:(x y z) age:30

but age to be not the filter, but rather part of score (for person of age 30 multiplier will be 1.0, for 25-year-old person 0.8 etc.)

Can this be achieved in a sensible manner?

EDIT: Finally I found out this can be done by wrapping ValueSourceQuery and TermQuery with CustomScoreQuery. See my solution below.

EDIT 2: With fast-changing versions of Lucene, I just want to add that it was tested on Lucene 3.0 (Java).

Okay, so here's (a bit verbose) proof-of-concept as a full JUnit test. Haven't tested its efficiency yet for large index, but from what I've read probably after a warm-up it should perform well, providing there's enough RAM available to cache numeric fields.

  package tests;

  import org.apache.lucene.analysis.Analyzer;
  import org.apache.lucene.analysis.WhitespaceAnalyzer;
  import org.apache.lucene.document.Document;
  import org.apache.lucene.document.Field;
  import org.apache.lucene.document.NumericField;
  import org.apache.lucene.index.IndexWriter;
  import org.apache.lucene.queryParser.QueryParser;
  import org.apache.lucene.search.IndexSearcher;
  import org.apache.lucene.search.Query;
  import org.apache.lucene.search.ScoreDoc;
  import org.apache.lucene.search.TopDocs;
  import org.apache.lucene.search.function.CustomScoreQuery;
  import org.apache.lucene.search.function.IntFieldSource;
  import org.apache.lucene.search.function.ValueSourceQuery;
  import org.apache.lucene.store.Directory;
  import org.apache.lucene.store.RAMDirectory;
  import org.apache.lucene.util.Version;

  import junit.framework.TestCase;

  public class AgeAndContentScoreQueryTest extends TestCase
  {
     public class AgeAndContentScoreQuery extends CustomScoreQuery
     {
        protected float peakX;
        protected float sigma;

        public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) {
           super(subQuery, valSrcQuery);
           this.setStrict(true); // do not normalize score values from ValueSourceQuery!
           this.peakX = peakX;   // age for which the age-relevance is best
           this.sigma = sigma;
        }

        @Override
        public float customScore(int doc, float subQueryScore, float valSrcScore){
           // subQueryScore is td-idf score from content query
           float contentScore = subQueryScore;

           // valSrcScore is a value of date-of-birth field, represented as a float
           // let's convert age value to gaussian-like age relevance score
           float x = (2011 - valSrcScore); // age
           float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma);

           float finalScore = ageScore * contentScore;

           System.out.println("#contentScore: " + contentScore);
           System.out.println("#ageValue:     " + (int)valSrcScore);
           System.out.println("#ageScore:     " + ageScore);
           System.out.println("#finalScore:   " + finalScore);
           System.out.println("+++++++++++++++++");

           return finalScore;
        }
     }

     protected Directory directory;
     protected Analyzer analyzer = new WhitespaceAnalyzer();
     protected String fieldNameContent = "content";
     protected String fieldNameDOB = "dob";

     protected void setUp() throws Exception
     {
        directory = new RAMDirectory();
        analyzer = new WhitespaceAnalyzer();

        // indexed documents
        String[] contents = {"foo baz1", "foo baz2 baz3", "baz4"};
        int[] dobs = {1991, 1981, 1987}; // date of birth

        IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i < contents.length; i++) 
        {
           Document doc = new Document();
           doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index
           doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i]));      // store & index
           writer.addDocument(doc);
        }
        writer.close();
     }

     public void testSearch() throws Exception
     {
        String inputTextQuery = "foo bar";
        float peak = 27.0f;
        float sigma = 0.1f;

        QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer);
        Query contentQuery = parser.parse(inputTextQuery);

        ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) );
         // or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT);

        CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma);

        IndexSearcher searcher = new IndexSearcher(directory);
        TopDocs docs = searcher.search(finalQuery, 10);

        System.out.println("\nDocuments found:\n");
        for(ScoreDoc match : docs.scoreDocs)
        {
           Document d = searcher.doc(match.doc);
           System.out.println("CONTENT: " + d.get(fieldNameContent) );
           System.out.println("D.O.B.:  " + d.get(fieldNameDOB) );
           System.out.println("SCORE:   " + match.score );
           System.out.println("-----------------");
        }
     }
  }

This can be achieved using Solr's FunctionQuery

继续阅读：lucene scoring tf-idf

Lucene custom scoring for numeric fields

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？