开发者

Rails: A good search algorithm

I'm trying to return results more like the search

My curren algorithm is this

def search_conditions(column, q)
  vars  = []
  vars2 = []

  vars << q

  if q.size > 3
    (q.size-2).times do |i|
      vars2 << q[i..(i+2)]
      next if i == 0
      vars << q[i..-1]
      vars << q[0..(q.size-1-i)]
      vars << q[i % 2 == 0 ? (i/2)..(q.size-(i/2)) : (i/2)..(q.size-1-(i/2))] if i > 1
    end
  end

  query = "#{column} ILIKE ?"
  vars = (vars+vars2).uniq

  return [vars.map { query }.join(' OR ')] + vars.map { |x| "%#{x}%" }
end

If I search for "Ruby on Rails" it will make 4 search ways.

1) Removing the left letters "uby on Rails".."ils"

2) Removing the right letters "Ruby on Rail".."Rub"

3) Removing left and right letters "uby on Rails", "uby on Rail" ... "on "

4) Using only 3 letters "Rub", "uby", "by ", "y o", " on"开发者_JS百科 ... "ils"

Is good to use these 4 ways? There any more?


Why are you removing these letters? Are you trying to make sure that if someone searches for 'widgets', you will also match 'widget'?

If so, what you are trying to do is called 'stemming', and it is really much more complicated than removing leading and trailing letters. You may also be interested in removing 'stop words' from your query. These are those extremely common words that are necessary to form grammatically-correct sentences, but are not very useful for search, such as 'a', 'the', etc.

Getting search right is an immensely complex and difficult problem. I would suggest that you don't try to solve it yourself, and instead focus on the core purpose of your site. Perhaps you can leverage the search functionality from the Lucene project in your code. This link may also be helpful for using Lucene in Ruby on Rails.

I hope that helps; I realize that I sort of side-stepped your original question, but I really would not recommend trying to tackle this yourself.


As pkaeding says, stemming is far too complicated to try to implement yourself. However, if you want to search for similar (not exact) strings in MySQL, and your user search terms are very close to the full value of a database field (ie, you're not searching a large body of text for a word or phrase), you might want to try using the Levenshtein distance. Here is a MySQL implementation.

The Levenshtein algorithm will allow you to do "fuzzy" matching, give you a similarity score, and help you avoid installation and configuration of a search daemon, which is complicated. However, this is really only for a very specific case, not a general site search.


While, were all suggesting other possible solutions, check out:

Sphinx - How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.

Thinking Sphinx - A Ruby connector between Sphinx and ActiveRecord.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜