Comparing the distance between arrays?
How to compare the similarity between two arrays? Say I have:
Base Array: [.5,0,0,0,.25,0,0,.25,0,0,0,0]
Array 1: [1,0,0,0,1,0,0,1,0,0,0,0]
Array 2: [0,0,1,0,0,0,1,0,0,1,0,0]
Array 3: [1,0,0,0,0,0,0,0,0,0,0,0]
Regarding the arrays above, the answer should be Array 1. The answer is Array 1 because, the array elements are 'closer' in structure to the array elements of the base array. Differing from Array 3, .25 is closer to 1 than 0. Another example:
Base Array: [.75,0,0,0,0,0,0,0,.25,0,0,0]
Array 1: [1,0,0,0,1,0,0,1,0,0,0,0]
Array 2: [0,0,1,0,0,0,1,0,0,1,0,0]
Array 3: [1,0,0,0,0,0,0,0,0,0,0,0]
Which in this case, Array 3 should then be the answer.
However, using my current algo (which I will give later), the answer becomes Array 3. Here is what I am using:
for (int i = 0; i < basearray.Length; i++)
{
temp = (basearray[i] - arrayX[i]);
dist += temp * temp;
}
So, I think there is something wrong with my algo? Or maybe, I need to use a 'different' kind of algorithm and not distance (since essentially, .25 IS closer to 0 than 1, but开发者_高级运维 what I want is otherwise).
Thanks!
UPDATE:
I found the answer! Thanks for all those for the help. Here it is:
float[] pbaseArrX = new float[3];
float[] pcompArrX = new float[3];
float dist1 = 0, dist2 = 0;
for (int i = 0; i < baseArrX.Count; i++)
{
pbaseArrX[i] = baseArrX[i] / (baseArrX[0] + baseArrX[1] + baseArrX[2]);
}
//Do the following for both compArr1 and compArr2;
for (int i = 0; i < compArrX.Count; i++)
{
pcompArrX[i] = pcompArrX[i] / (pcompArrX[0] + pcompArrX[1] + pcompArr[2]);
}
//Get distance for both
for (int i = 0; i < pcompArrX.Count; i++)
{
distX = distX + ((pcompArrX[i] - pbaseArrX[i])^2);
}
//Then just use conditional to determine which is 'closer'
It seems like you want to compare the arrays as rays (just direction), but you're comparing them as vectors (direction and magnitude). I'd suggest comparing the arrays with cosine similarity, which is just the cosine of the angle between the vectors and thus comparison of only their directions. For the arrays presented, the cosine similarity between the base array and array 1 is 0.94 while that with array 2 is 0.82, matching your expectations.
Array 3 is the correct answer. The algorithm you're using is giving you the right result.
Basically, for me Array 3 is more similar to the base Array than Array1. What's the pattern that you're looking for? You say Array1 should be the result... why?
Distance is just a way to compare two arrays by an arbitrary mathematical asumption, there has no real "logic" behind it, but that we give it to it.
If you want the result to be Array1 then:
- Define WHY Array1 shall be the result from logical terms.
- Translate WHY Array1 shall be the result to a mathematical formulation
- Implement that formulation
The problem here is that your concept of "similarity" is not clearly defined. Depending on the use case of the data, there are infinitely many ways to define similarity. Leaving your array aside there is an simple example for that:
- Glasses and a Binocular are similar, because you use both of them to look at things.
- Glasses and a Bicycle are similar, because both consist of two circles linked with each other
- Glasses and Grass are similary, because both start with "G" and end with "S"
As you can see unless you define exactly what you need anything can be similar to anything. Humans are good to use the right kind of similarity for the right task, but a computer wont be able to do that, unless you tell it explicitly what you want.
Leaving this point aside, there is one common case of similarity, that is quite often used for sequence data in data mining. This is called the cosine distance, and it is not that different from what you are using. It is called the cosine distance. Here is the algorithm:
for (int i = 0; i < basearray.Length; i++)
{
temp += (basearray[i] * arrayX[i]);
f_base += (basearray[i] * basearray[i]);
f_array += (array[i] * array[i]);
}
dist = 1 - (temp / sqrt( f_base * f_array ));
This is basically just computing the "Angle" between both array depicted as points in n-Dimensional space. Works just fine in most cased and can be adopted easily to other needs (when other kinds of similarity are needed).
Mathematically, each array is a point and the distance measure is called a norm. You are using a version of the Euclidean norm which is our standard measure of spatial distance in three dimensions. It's just missing the square root out because all you're interested in which one is closest as opposed to measuring actual distance so it would still work for you.
In your example, the third array is definitely closest in Euclidean distance, because your base array is a heck of a lot closer to a zero array than your first array is. They may have "similar structure" but you're looking at it in the wrong way. Your distance measure is interested in numerical distance, and 0 (in array 3) is a lot closer to 0.25 than 1 is (in array 1).
If you're looking at "structure" it means you think 0 is a lot more significant than any other number. i.e. you want to reward a matching array for having non-zero's in the same place, rather than being numerically close to 0.
I'm not sure what sort of norm you want for that and, to be honest, this gives me the impression we're missing out on what you need to achieve at the end of the day - it's a bit difficult to make suggestions on what we know so far.
精彩评论