开发者

Index of sentences

I have a few tens of thousands of short documents, consisting of 10 to 20 English sentences each (as well as some other non-sentence stuff like possibly HTML formatting or other "junk"). These documents are chopped out of other longer documents - in other words the shorter document "A1" might be sentence 10 through 20 of original document "A" and another shorter document "A2" might be sentence 11 through 25 of the same document original document "A" , and some of the o开发者_开发技巧riginal source documents might be summaries or copies of other original source documents, so that original source document "B" might also have sentences 10 through 20 of original source document "A", although not necessarily in the same location. And that same group of sentences might have been extracted from "B" into another short document "B3".

For each sentence, or at least each sentence over a certain length (say, > 3 words long), I'd like to produce a list of every short document that sentence occurs in. I'd like to scan the existing shorter documents and produce that index, and also update that index as I break up further longer original source documents into shorter documents.

I'm thinking what I need is some code to make an efficient hash code for a sentence which has a very low likelihood of producing the same hash code for two different sentences. Is the hash algorithm used in Java String.hashCode() a good choice for that? MD5 or other cryptographic hash seems like it would be too expensive and overkill for this purpose.


I recently evaluated hash algorithms with the requirement that in a few million inputs there should be virtually no possibility of hash collision, and the hashing must be very fast. CityHash was the winner, hands-down.

If you're interested in calculating the probability of a hash collision, that subject is sometimes referred to as the Birthday Problem. The math behind it is here:

https://sites.google.com/site/craigandera/craigs-stuff/odds-ends/the-birthday-problem-calculator


More broadly, you would probably benefit from reading this book. The structure you are describing is a classic inverted index: the book describes efficient algorithms for creating, updating and performing interesting queries on them.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜