开发者

How do I convert between a measure of similarity and a measure of difference (distance)?

Is there a general way to convert between a measure of similarity and a measure of distance?

Consider a similarity measure like the number of 2-grams that two strings have in common.

2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4

What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance?

This is just an example...I'm looking for a general solution, if one exists. Like how t开发者_运维问答o go from Levenshtein distance to a measure of similarity?

I appreciate any guidance you may offer.


Let d denotes distance, s denotes similarity. To convert distance measure to similarity measure, we need to first normalize d to [0 1], by using d_norm = d/max(d). Then the similarity measure is given by:

s = 1 - d_norm.

where s is in the range [0 1], with 1 denotes highest similarity (the items in comparison are identical), and 0 denotes lowest similarity (largest distance).


If your similarity measure (s) is between 0 and 1, you can use one of these:

1-s
sqrt(1-s)
-log(s)
(1/s)-1


Doing 1/similarity is not going to keep the properties of the distribution.

the best way is distance (a->b) = highest similarity - similarity (a->b). with highest similarity being the similarity with the biggest value. You hence flip your distribution. the highest similarity becomes 0 etc


Yes, there is a most general way to change between similarity and distance: a strictly monotone decreasing function f(x).

That is, with f(x) you can make similarity = f(distance) or distance = f(similarity). It works in both directions. Such function works, because the relation between similarity and distance is that one decreases when the other increases.

Examples:

These are some well-known strictly monotone decreasing candidates that work for non-negative similarities or distances:

  • f(x) = 1 / (a + x)
  • f(x) = exp(- x^a)
  • f(x) = arccot(ax)

You can choose parameter a>0 (e.g., a=1)

Edit 2021-08

A very practical approach is to use the function sim2diss belonging to the statistical software R. This functions provides a up to 13 methods to compute dissimilarity from similarities. Sadly the methods are not at all explained: you have to look into the code :-\


similarity = 1/difference

and watch out for difference = 0


According to scikit learn:

Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than objects a and c. A kernel must also be positive semi-definite.

There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be the distance, and S be the kernel:

  • S = np.exp(-D * gamma), where one heuristic for choosing gamma is 1 / num_features
  • S = 1. / (D / np.max(D))


In the case of Levenshtein distance, you could increase the sim score by 1 for every time the sequences match; that is, 1 for every time you didn't need a deletion, insertion or substitution. That way the metric would be a linear measure of how many characters the two strings have in common.


In one of my projects (based on Collaborative Filtering) I had to convert between correlation (cosine between vectors) which was from -1 to 1 (closer 1 is more similar, closer to -1 is more diverse) to normalized distance (close to 0 the distance is smaller and if it's close to 1 the distance is bigger)

In this case: distance ~ diversity

My formula was: dist = 1 - (cor + 1)/2

If you have similarity to diversity and the domain is [0,1] in both cases the simlest way is:

dist = 1 - sim

sim = 1 - dist


Cosine similarity is widely used for n-gram count or TFIDF vectors.

from math import pi, acos
def similarity(x, y):
    return sum(x[k] * y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5

Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):

def distance_metric(x, y):
    return 1 - 2 * acos(similarity(x, y)) / pi

Both of these metrics range between 0 and 1.

If you have a tokenizer that produces N-grams from a string you could use these metrics like this:

>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))

>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774

I found the elegant inner product of Counter in this SO answer

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜