Counting sentences: Database (like h2) vs. Lucene vs.?

2022-12-20 11:21 问答作者：

I am doing some linguistic research that depends on being able to query a corpus of 100 million sentences. The information I need from that corpus is along the lines: how many sentences had "john" as first word, "went" as second word and "hospital" as the fifth word...etc So I just need the count and don't need to actually retrieve the sentences.

The idea I had was to split these sentences into words and store them into a database, where the columns would be the positions (word-1, word-2, word-3..etc) and the sentences would be the rows. So it looks like:

Word1 Word2 Word3 Word4 Word5 ....

Congress approved a new bill

John went to school

.....

And my purpose will then be fulfilled by calling something like COUNT(SELECT * where Word1=John and Word4=school). But I am wondering: Can this be better achieved using Lucene (or some othe开发者_StackOverflow中文版r tool)?

The program I am writing (in Java) will be doing tens of thosands of such queries on that 100 million sentece corpus. So speed of look-up is important.

Thanks for any advice,

Anas

Assuming that the queries are as simple as you have indicated, a simple SQL db (Postgres, MySQL, possibly H2) would be perfect for this.

I suppose you already have infrastructure to create tokens from a given sentence. You can create a lucene document with one field for each word in the sentence. You can name the fields as field1, field2, and so on. Since, lucene doesn't have a schema like DB, you can define as many fields, on the fly, as you wish. You can add an additional identifier field if you want to identify which sentences matched a query.

While searching, your typical lucene query will be

+field1:John +field4:school

Since you are not bothered about the speed of retrieval, you can write a custom Collector which will ignore scores. (That will return results significantly faster as well.)

Since you don't plan to retrive back the matching sentences or words, you should only index these fields and not store. That should push performance up by a notch.

Lucene span queries can implement positional search. Use SpanFirst to find a word in the first N positions of a document, and combine it with SpanNot to rule out the first N-1.

Your example query would like this:

<BooleanQuery: +(+spanFirst(john, 1) +spanFirst(went, 2)) +spanNot(spanFirst(hospital, 5), spanFirst(hospital, 4))>

Lucene also of course allows getting the total hit count of a search result without iterating all the docs.

I suggest you read Search Engine versus DBMS. From what I gather, you do need a database rather than a full text search library.
In any case, I suggest you preprocess your text and replace every word/token with a number using a dictionary. This replaces every sentence with an array of word codes. I would then store every word place in a separate database column, simplifying counts and making them quicker. For example:

A boy and a girl drank milk

translates into:

120 530 14 120 619 447 253

(I chose arbitrary word codes) leading to store a row

120 530 14 120 619 447 253 0 0 0 0 0 0 0 ....

(until the number of words you allocate per a sentence is exhausted).

This is a somewhat sparse matrix, so maybe this question will help.

Look at Apache Hadoop and Map Reduce. It's developed for things like this.

Or you can done it by the hand, using only only java by

List triple = new ArrayList(3);    
for (String word: inputFileWords) {
  if (triple.size == 3) {
      resultFile.println(StringUtils.join(" ", triple));
      triple.remove(0);
  }
  triple.add(line);
}

then sort this file and sum all duplicate lines (manually or from some command line utility), it will be fast as possible.

继续阅读：database lookup lucene performance

Counting sentences: Database (like h2) vs. Lucene vs.?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？