开发者

Problem: Need to look up a sentence in a database of millions of sentences?

So, I'll be storing millions of sentences in a database each with an auth开发者_运维百科or. I need to be able to efficiently search for a sentence and return the author. Now, I'd like to be able to mispell a word or forget a word or two in this sentence, and have the application still be able to match (fuzzy-esque). Can anyone point me in the right direction? How does google do this? Because I can search for lyrics on google for instance and it will return the song with the lyrics? I'm looking to do the same thing?

Thanks all.

If fuzzy makes things too complicated, then I can deal with just an efficient sentence search.


If you're writing in Java, you can try Lucene.

Shouldn't it really be "document" and author instead of individual sentences?


For full text search check inverted index data structure.

This is how search engines do it

samples of code

UPDATE: also if you're working on a distributed system check Hadoop - open source alternative for Goolge's MapReduce


Full Text Indexing on SQL Server or Oracle will most likey be what you're after right out of the box. They can go fuzzy, use word roots and other clever stuff. I can't comment on other DB engines though a quick google shows most will have something similar. For some reason I expect them to be more limited in the fuzziness.


Indeed fuzzy matching is not a simple thing to do, although some databases implement some kind of fuzzy search, depending on the method used and your data, your results may vary. Here's a link that explains fuzzy searches in SQL sever

http://msdn.microsoft.com/en-us/magazine/cc163731.aspx

As for the sentence search, most db engines implement full text search/indexing that you may want to look at... It comes with trade offs in terms of performance and storage, but you may want to look at it


How does google do this?

Using inverted indexes. The details are proprietary, but you can bet your last dollars that there is a lot of replication and storing of the indexes, etc in memory so that they can handle the vast number of search requests they get per second.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜