Detect similar sounding words in Ruby

2022-12-25 01:45 问答作者：

I'm aware of SOUNDEX and (double) Metaphone, but these don't let me test for the similarity of words as a whole - for example "Hi" sounds very similar to "Bye", but both of these methods will mark them as completely differe开发者_如何转开发nt.

Are there any libraries in Ruby, or any methods you know of, that are capable of determining the similarity between two words? (Either a boolean is/isn't similar, or numerical 40% similar)

edit: Extra bonus points if there is an easy method to 'drop in' a different dialect or language!

I think you're describing levenshtein distance. And yes, there are gems for that. If you're into pure Ruby go for the text gem.

$ gem install text

The docs have more details, but here's the crux of it:

Text::Levenshtein.distance('test', 'test')    # => 0
Text::Levenshtein.distance('test', 'tent')    # => 1

If you're ok with native extensions...

$ gem install levenshtein

It's usage is similar. It's performance is very good. (It handles ~1000 spelling corrections per minute on my systems.)

If you need to know how similar two words are, use distance over word length.

If you want a simple similarity test, consider something like this:

Untested, but straight forward:

String.module_eval do
   def similar?(other, threshold=2)
    distance = Text::Levenshtein.distance(self, other)
    distance <= threshold
  end
end

What you need is a pronunciation dictionary. The best free one is the CMU Pronouncing Dictionary.

Map the strings to their pronunciations, then do a bit of preprocessing (for example, you'll probably want to remove the numbers that cmudict uses to indicate stress), then you could use one of the techniques others have suggested, such as levenshtein distance, on the pronunciation strings instead of the input strings.

For an example of something similar, see dict/dict.rb in Rhyme Ninja.

You might first preprocess the words using a thesaurus database, which will convert words with similar meaning to the same word. There are various thesaurus databases out there, unfortunately I couldn't find a decent free one for English ( http://www.gutenberg.org/etext/3202 is the one I found, but this doesn't show what relations the specific words have (like similar; opposite; alternate meaning; etc.), so all words on the same line have some relation, but you won't know what that relation is )

But for example for Hungarian there is a good free thesaurus database, but you don't have soundex/metaphone for hungarian texts...

If you have the database writing a program that preprocesses the texts isn't too hard (ultimately it's a simple search-replace, but you might want to preprocess the thesaurus database using simplex or methaphone too)

继续阅读：phoneme ruby

Detect similar sounding words in Ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？