Ruby: How does one test for similarity between two blocks of text?

2023-04-11 07:56 问答作者：

So, lets say I have these texts:

Text1:

absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients.

Text2:

zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate

Text 3

When the zerg first arrived in the Koprulu sector, they were unified by their absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate the advanced protoss race, it found useful but undeveloped material in humanity.

Now, The end of Text1 and the beginning of text2 overlap, so we'd say the text blocks aren't unique. Similarly, with Text3, Text1 can be found inside (as well as Text2) so this is also not unique, due to the overlap.

So, my question:

How do I go about writing something that can look at consecutive letters or words and determine uniqueness? Ideally, I'd want such a method to return some value, representing the amount of similarity--maybe the number of matched words over the average of the two text blocks' size. When it returns 0, both texts tested should be completely unique.

Some problem's I've run into when playing around with Ruby's string methods.

First, I started trying to find the intersection of two strings.

>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"

problem with the above method is that it just appends letters that are in common to the end of the result (we lose the order of characters), which would make it hard to test uniqueness. But I don't think intersection is the best way to start this similarity comparison. Any number of combinations of开发者_StackOverflow words could be present in both texts that are being compared. So maybe if I made an array of consecutive similarities... but that would require us to traverse one of the texts for as many times as we try phrase lengths.

I guess I really just don't know where to start, and in a way that is efficient and not O(n^too_high).

I believe you're looking for is the Longest Common Substring problem, i.e. the problem of finding, given two strings, of the longest substring they both have in common. The link is to the Wikipedia page which will help you understand the domain and contains a pseudocode example of an algorithm that runs in O(nm) time.

Further, Wikibooks' Algorithm Implementation book has an implementation in Ruby. It includes an lcs_size method that may be all you need. In short, if lcs_size(text1, text2) returns 4 that means text1 and text2 have very little consecutive text in common, probably just a single word, but if it returns, say, 40, they might have an entire sentence in common.

Hope that's helpful!

Here's a Ruby implementation of the Levenshtein distance algorithm. After installing the gem, you can use it like that:

require 'rubygems'
require 'Text'

t1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."

t2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

puts Text::Levenshtein.distance(t1,t2)

Your problem is not Ruby. It's the algorithm. You could split each text into words, then run a minimum distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance) to get that.

The smaller the number, the more similar the texts are.

This could be improved a lot but it's an idea:

txt1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."
txt2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

def txt_to_ary(txt)
    txt.gsub(/\.|,/, ' ').downcase.split(/\s+/)
end

def longest_match(txt1, txt2)
    longest = 0
    txt1.each_with_index do |w1, i|
        txt2.each_with_index do |w2, j|
            next unless w1 == w2
            k = 0
            k += 1 while txt1[i+k] == txt2[j+k]
            longest = k if k > longest          
        end
    end
    longest
end

txt1 = txt_to_ary( txt1 )
txt2 = txt_to_ary( txt2 )

puts longest_match(txt1, txt2) #=>12

The amatch gem is perfect for string comparisons.

继续阅读：comparison performance ruby search text

Ruby: How does one test for similarity between two blocks of text?

Text1:

Text2:

Text 3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Text1:

Text2:

Text 3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？