开发者

Counting sentences: Database (like h2) vs. Lucene vs.?

I am doing some linguistic research that depends on being able to query a corpus of 100 million sentences. The information I need from that corpus is along the lines: how many sentences had "john" as first word, "went" as second word and "hospital" as the fifth word...etc So I just need the count and don't need to actually retrieve the sentences.

The idea I had was to split these sentences into words and store them into a database, where the columns would be the positions (word-1, word-2, word-3..etc) and the sentences would be the rows. So it looks like:

Word1 Word2 Word3 Word4 Word5 ....

Congress approved a new bill

John went to school

.....

And my purpose will then be fulfilled by calling something like COUNT(SELECT * where Word1=John and Word4=school). But I am wondering: Can this be better achieved using Lucene (or some othe开发者_StackOverflow中文版r tool)?

The program I am writing (in Java) will be doing tens of thosands of such queries on that 100 million sentece corpus. So speed of look-up is important.

Thanks for any advice,

Anas


Assuming that the queries are as simple as you have indicated, a simple SQL db (Postgres, MySQL, possibly H2) would be perfect for this.


I suppose you already have infrastructure to create tokens from a given sentence. You can create a lucene document with one field for each word in the sentence. You can name the fields as field1, field2, and so on. Since, lucene doesn't have a schema like DB, you can define as many fields, on the fly, as you wish. You can add an additional identifier field if you want to identify which sentences matched a query.

While searching, your typical lucene query will be

+field1:John +field4:school

Since you are not bothered about the speed of retrieval, you can write a custom Collector which will ignore scores. (That will return results significantly faster as well.)

Since you don't plan to retrive back the matching sentences or words, you should only index these fields and not store. That should push performance up by a notch.


Lucene span queries can implement positional search. Use SpanFirst to find a word in the first N positions of a document, and combine it with SpanNot to rule out the first N-1.

Your example query would like this:

<BooleanQuery: +(+spanFirst(john, 1) +spanFirst(went, 2)) +spanNot(spanFirst(hospital, 5), spanFirst(hospital, 4))>

Lucene also of course allows getting the total hit count of a search result without iterating all the docs.


  • I suggest you read Search Engine versus DBMS. From what I gather, you do need a database rather than a full text search library.
  • In any case, I suggest you preprocess your text and replace every word/token with a number using a dictionary. This replaces every sentence with an array of word codes. I would then store every word place in a separate database column, simplifying counts and making them quicker. For example:

A boy and a girl drank milk

translates into:

120 530 14 120 619 447 253

(I chose arbitrary word codes) leading to store a row

120 530 14 120 619 447 253 0 0 0 0 0 0 0 ....

(until the number of words you allocate per a sentence is exhausted).

This is a somewhat sparse matrix, so maybe this question will help.


Look at Apache Hadoop and Map Reduce. It's developed for things like this.


Or you can done it by the hand, using only only java by

List triple = new ArrayList(3);    
for (String word: inputFileWords) {
  if (triple.size == 3) {
      resultFile.println(StringUtils.join(" ", triple));
      triple.remove(0);
  }
  triple.add(line);
}

then sort this file and sum all duplicate lines (manually or from some command line utility), it will be fast as possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜