开发者

JOINS in Lucene

Is there any way to implement JOINS in开发者_如何学C Lucene?


You can also use the new BlockJoinQuery; I described it in a blog post here:

http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html


You can do a generic join by hand - run two searches, get all results (instead of top N), sort them on your join key and intersect two ordered lists. But that's gonna thrash your heap real hard (if the lists even fit in it).

There are possible optimizations, but under very specific conditions.
I.e. - you do a self-join, and only use (random access) Filters for filtering, no Queries. Then you can manually iterate terms on your two join fields (in parallel), intersect docId lists for each term, filter them - and here's your join.

There's an approach handling a popular use-case of simple parent-child relationships with relatively small numer of children per-document - https://issues.apache.org/jira/browse/LUCENE-2454
Unlike the flattening method mentioned by @ntziolis, this approach correctly handles cases like: have a number of resumes, each with multiple work_experience children, and try finding someone who worked at company NNN in year YYY. If simply flattened, you'll get back resumes for people that worked for NNN in any year & worked somewhere in year YYY.

An alternative for handling simple parent-child cases is to flatten your doc, indeed, but ensure values for different children are separated by a big posIncrement gap, and then use SpanNear query to prevent your several subqueries from matching across children. There was a few-years old LinkedIn presentation about this, but I failed to find it.


Lucene does not support relationships between documents, but a join is nothing else but a specific combination of multiple AND within parenthesis, but you will need to flatten the relationship first.

Sample (SQL => Lucene):

SQL:

SELECT Order.* FROM Order
JOIN Customer ON Order.CustomerID = Customer.ID
WHERE Customer.Name = 'SomeName'
AND Order.Nr = 400

Lucene:
Make sure you have all the neccessary fields and their respective values on the document like: Customer.Name => "Customer_Name" and
Order.Nr => "Order_Nr"

The query would then be:

( Customer_Name:"SomeName" AND Order_Nr:"400" )


https://issues.apache.org/jira/browse/SOLR-2272


Use joinutil. It allows query time joins.

See: http://lucene.apache.org/core/4_0_0/join/org/apache/lucene/search/join/JoinUtil.html


A little late but you could use Package org.apache.lucene.search.join : https://lucene.apache.org/core/6_3_0/join/org/apache/lucene/search/join/package-summary.html

From their documentation:

The index-time joining support joins while searching, where joined documents are indexed as a single document block using IndexWriter.addDocuments().

   String fromField = "from"; // Name of the from field
   boolean multipleValuesPerDocument = false; // Set only yo true in the case when your fromField has multiple values per document in your index
   String toField = "to"; // Name of the to field
   ScoreMode scoreMode = ScoreMode.Max // Defines how the scores are translated into the other side of the join.
   Query fromQuery = new TermQuery(new Term("content", searchTerm)); // Query executed to collect from values to join to the to values

   Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode);
   TopDocs topDocs = toSearcher.search(joinQuery, 10); // Note: toSearcher can be the same as the fromSearcher
   // Render topDocs...


There are some implementations on the top of Lucene that make those kind of joins among several different indexes possible. Numere (http://numere.stela.org.br/) enable that and make it possible to get results as a RDBMS result set.


Here is an example Numere provides an easy way to extract analytical data from Lucene indexes

select a.type, sum(a.value) as "sales", b.category, count(distinct b.product_id) as "total"
from a (index)
inner join b (index) on (a.seq_id = b.seq_id)
group by a.type, b.category
order by a.type asc, b.category asc


    Join join = RequestFactory.newJoin();

    // inner join a.seq_id = b.seq_id

    join.on("seq_id", Type.INTEGER).equal("seq_id", Type.INTEGER);

    // left
    {
        Request left = join.left();
        left.repository(UtilTest.getPath("indexes/md/master"));
        left.addColumn("type").textType().asc();
        left.addMeasure("value").alias("sales").intType().sum();
    }

    // right
    {
        Request right = join.right();
        right.repository(UtilTest.getPath("indexes/md/detail"));
        right.addColumn("category").textType().asc();
        right.addMeasure("product_id").intType().alias("total").count_distinct();
    }

    Processor processor = ProcessorFactory.newProcessor();
    try {
        ResultPacket result = processor.execute(join);
        System.out.println(result);
    } finally {
        processor.close();
    }

Result:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<DATAPACKET Version="2.0">
  <METADATA>
    <FIELDS>
      <FIELD attrname="type" fieldtype="string" WIDTH="20" />
      <FIELD attrname="category" fieldtype="string" WIDTH="20" />
      <FIELD attrname="sales" fieldtype="i8" />
      <FIELD attrname="total" fieldtype="i4" />
    </FIELDS>
    <PARAMS />
  </METADATA>
  <ROWDATA>
    <ROW type="Book" category="stand" sales="127003304" total="2" />
    <ROW type="Computer" category="eletronic" sales="44765715835" total="896" />
    <ROW type="Meat" category="food" sales="3193526428" total="110" />

... continue

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜