Algorithm for discrete similarity metric

2023-02-13 12:41 问答作者：

Given that I have two lists that each contain a separate subset of a common superset, is there an algorithm to give me a similarity measurement?

Example:

A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate }

How similar are these two lists? Note that I do not know all elements of the common superset.

Update: I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies. Clarification: Order is of importance. If identical elements occupy the same position in the list, we have the highest similarity for that element. The similarity decreased the farther apart the identical elements are. The similarity is even lower if the element only exists in one of the lists.

I could even add the extra dimension that lower indices are of greate开发者_如何学编程r value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.

The Jaccard Index (aka Tanimoto coefficient) is used precisely for the use case recited in the OP's question.

The Tanimoto coeff, tau, is equal to Nc divided by Na + Nb - Nc, or

tau = Nc / (Na + Nb - Nc)

Na, number of items in the first set
Nb, number of items in the second set
Nc, intersection of the two sets, or the number of unique items common to both a and b

Here's Tanimoto coded as a Python function:

def tanimoto(x, y) :
  w = [ ns for ns in x if ns not in y ]
  return float(len(w) / (len(x) + len(y) - len(w)))

I would explore two strategies:

Treat the lists as sets and apply set ops (intersection, difference)
Treat the lists as strings of symbols and apply the Levenshtein algorithm

If you truly have sets (i.e., an element is simply either present or absent, with no count attached) and only two of them, just adding the number of shared elements and dividing by the total number of elements is probably about as good as it gets.

If you have (or can get) counts and/or more than two of them, you can do a bit better than that with something like cosine simliarity or TFIDF (term frequency * inverted document frequency).

The latter attempts to give lower weighting to words that appear in all (or nearly) all the "documents" -- i.e., sets of words.

What is your definition of "similarity measurement?" If all you want is how many items in the set are in common with each other, you could find the cardinality of A and B, add the cardinalities together, and subtract from the cardinality of the union of A and B.

If order matters you can use Levenshtein distance or other kind of Edit distance .

继续阅读：algorithm comparison machine-learning similarity

Algorithm for discrete similarity metric

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？