How to increase position offsets in a lucene index to correspond to <p> tags?

2023-02-28 13:29 问答作者：

I am using Lucene 3.0.3. In preparation to using SpanQuery and PhraseQuery, I would like to mark paragraph boundaries in my index in a way that will discourage these queries from matching across paragraph boundaries. I understand that I need to increment position 开发者_C百科by some suitably large value in the PositionIncrementAttribute when processing text to mark paragraph boundaries. Let's assume that in the source document, my paragraph boundaries are marked by <p>...</p> pairs.

How do I set up my token stream to detect the tags? Also, I don't actually want to index the tags themselves. For the purposes of indexing, I would rather increment the position of the next legitimate token, rather than emitting a token that corresponds to the tag, since I don't want it to affect search.

The easiest way to add gaps (= PositionIncrement > 1) is to provide a custom TokenStream. You do not need to change your Analyzer for that. However, HTML parsing should be done upstream (i.e., you should segment and clean your input text accordingly before feeding it to Lucene).

Here is a full, working example (imports omitted):

public class GapTest {

    public static void main(String[] args) throws Exception {
        final Directory dir = new RAMDirectory();
        final IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_4_10_1, new SimpleAnalyzer());
        final IndexWriter iw = new IndexWriter(dir, iwConfig);

        Document doc = new Document();
        doc.add(new TextField("body", "A B C", Store.YES));
        doc.add(new TextField("body", new PositionIncrementTokenStream(10)));
        doc.add(new TextField("body", "D E F", Store.YES));

        System.out.println(doc);
        iw.addDocument(doc);
        iw.close();

        final IndexReader ir = DirectoryReader.open(dir);
        IndexSearcher is = new IndexSearcher(ir);

        QueryParser qp = new QueryParser("body", new SimpleAnalyzer());

        for (String q : new String[] { "\"A B C\"", "\"A B C D\"",
                "\"A B C D\"", "\"A B C D\"~10", "\"A B C D E F\"~10",
                "\"A B C D F E\"~10", "\"A B C D F E\"~11" }) {
            Query query = qp.parse(q);
            TopDocs docs = is.search(query, 10);
            System.out.println(docs.totalHits + "\t" + q);
        }
        ir.close();
    }

    /**
     * A gaps-only TokenStream (uses {@link PositionIncrementAttribute}
     * 
     * @author Christian Kohlschuetter
     */
    private static final class PositionIncrementTokenStream extends TokenStream {
    private boolean first = true;
    private PositionIncrementAttribute attribute;
    private final int positionIncrement;

    public PositionIncrementTokenStream(final int positionIncrement) {
        super();
        this.positionIncrement = positionIncrement;
        attribute = addAttribute(PositionIncrementAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (first) {
            first = false;
            attribute.setPositionIncrement(positionIncrement);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        first = true;
    }
}

}

继续阅读：indexing lucene position tags

How to increase position offsets in a lucene index to correspond to <p> tags?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？