开发者

PHP MYSQL search engine using keywords

I am trying to implement search engine based on keywords search. Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?

What I need is:

My keywords:

search, faster, profitable

Their synonyms:

search: grope, google, iden开发者_如何学运维tify, search   
faster: smart, quick, faster  
profitable: gain, profit  

Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.


The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).

Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.

The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :

Let's assume that is your inverted index :

smart: [42,35]
gain: [42]
profit: [55]

Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].

To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".

Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).

The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜