开发者

Search query and keyword phrase algorithm

I am looking for an algorithm that will efficiently separate a search string into an array of known search phrase开发者_如何学编程s. For instance, if I type "Los Angeles pizza" it needs to know I am looking for "los Angeles" and "pizza", not "Los" and "Angeles pizza".

This is for a specialized search application, assume I have a dictionary of all phrases people will use.


The Google N-gram Corpus could be used to determine the most likely phrase divisions.

For reasonably short phrases, you could generate all the possible sets of n-grams that the phrase can be divided into (e.g. ["Los", "Angeles", "pizza"], ["Los Angeles", "pizza"], ["Los", "Angeles pizza"] and ["Los Angeles pizza"] for your example phrase), look them up in the corpus, and see which one(s) come out with the highest number of occurrences. (Considering the size of the corpus, you'll probably need to load it into a database rather than an in-memory hashtable.)

EDIT: By the looks of things, it's not freely available. Maybe there are some similar things that you could use, though. If not, there are certainly corpora of text from the web that you can download and use to create your own lists of n-grams.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜