Comparing arrays of strings for similarity
I have available to me hundreds of JSON strings. Each of these contains an array of 15-20 words sorted by some predetermined weight. This weight, if it's worth noting, is the amount of times these words are found in some chunk of text. What's the best way of finding similarity between arrays of words that are structured like this?
First idea that came to my head was to create a numerical hash of all the words together and basically compare these values to determine similarity. I wasn't very successful with this, since the resulting hash values of very similar strings were not very close. After some research regarding string comparison algorithms, I come to Stackoverflow in hopes of receiving more guidance. Thanks in advance, and please let me know if you need more details of the problem.开发者_如何学Python
Edit 1: Clarifying what I'm trying to do: I want to determine how similar two arrays are according to the words each of these have. I would also like to take into consideration the weight each word carries in each array. For example:
var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];
In that example, array 4 and array 2 are more similar than array 2 and array 3 because, even though both have the same words, the weight is the same for both of them in array 4 and 2. I hope that makes it a little bit easier to understand. Thanks in advance.
I think that what you want is "cosine similarity", and you might also want to look at vector space models. If you are coding In Java, you can use the open source S-space package.
(added on 31 Oct) Each element of the vector is the count of one particular string. You just need to transform your arrays of strings into such vectors. In your example, you have three words - "hill", "head", "valley". If your vector is in that order, the vectors corresponding to the arrays would be
// array: #hill, #head, #valley
array1: {5, 5, 0}
array2: {0, 5, 7}
array3: {0, 6, 5}
array4: {0, 5, 7}
Given that each array has to be compared to every other array, you are looking at a serious amount of processing along the lines of ∑(n-1) times the average number of "words" in each array. You'll need to store the score for each comparison, then make some sense of it.
e.g.
var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];
// Comparison score is summed product of matching word counts
function compareThings() {
var a, b, i = arguments.length,
j, m, mLen, n, nLen;
var word, score, result = [];
if (i < 2) return;
// For each array
while (i--) {
a = arguments[i];
j = i;
// Compare with every other array
while (j--) {
b = arguments[j];
score = 0;
// For each word in array
for (m=0, mLen = b.length; m<mLen; m++) {
word = b[m].word
// Compare with each word in other array
for (n=0, nLen=a.length; n<nLen; n++) {
// Add to score
if (a[n].word == word) {
score += a[n].count * b[m].count;
}
}
}
// Put score in result
result.push(i + '-' + j + ':' + score);
}
}
return result;
}
var results = compareThings(array1, array2, array3, array4);
alert('Raw results:\n' + results.join('\n'));
/*
Raw results:
3-2:65
3-1:74
3-0:25
2-1:65
2-0:30
1-0:25
*/
results.sort(function(a, b) {
a = a.split(':')[1];
b = b.split(':')[1];
return b - a;
});
alert('Sorted results:\n' + results.join('\n'));
/*
Sorted results:
3-1:74
3-2:65
2-1:65
2-0:30
3-0:25
1-0:25
*/
So 3-1 (array4 and array2) have the highest score. Fortunately the comparison need only be one way, you don't have to compare a to b and b to a.
Here is an attempt. The algorithm is not very smart (a difference > 20 is the same as not having the same words), but could be a useful start:
var wordArrays = [
[{"word":"hill","count":5},{"word":"head","count":5}]
, [{"word":"valley","count":7},{"word":"head","count":5}]
, [{"word":"head", "count": 6}, {"word": "valley", "count": 5}]
, [{"word": "valley", "count": 7}, {"word":"head", "count": 5}]
]
function getSimilarTo(index){
var src = wordArrays[index]
, values
if (!src) return null;
// compare with other arrays
weighted = wordArrays.map(function(arr, i){
var diff = 0
src.forEach(function(item){
arr.forEach(function(other){
if (other.word === item.word){
// add the absolute distance in count
diff += Math.abs(item.count - other.count)
} else {
// mismatches
diff += 20
}
})
})
return {
arr : JSON.stringify(arr)
, index : i
, diff : diff
}
})
return weighted.sort(function(a,b){
if (a.diff > b.diff) return 1
if (a.diff < b.diff) return -1
return 0
})
}
/*
getSimilarTo(3)
[ { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
index: 1,
diff: 100 },
{ arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
index: 3,
diff: 100 },
{ arr: '[{"word":"head","count":6},{"word":"valley","count":5}]',
index: 2,
diff: 103 },
{ arr: '[{"word":"hill","count":5},{"word":"head","count":5}]',
index: 0,
diff: 150 } ]
*/
Sort the arrays by word before attempting comparison. Once this is complete, comparing two arrays will require exactly 1 pass through each array.
After sorting the arrays, here is a compare algorithm (psuedo-java):
int compare(array1, array2)
{
returnValue = 0;
array1Index = 0
array2Index = 0;
while (array1Index < array1.length)
{
if (array2Index < array2.length)
{
if (array1[array1Index].word == array2[array2Index].word) // words match.
{
returnValue += abs(array1[array1Index].count - array2[array2Index].count);
++array1Index;
++array2Index;
}
else // account for the unmatched array2 word.
{
// 100 is just a number to give xtra weight to unmatched numbers.
returnValue += 100 + array2[array2Index].count;
++array2Index;
}
}
else // array2 empty and array1 is not empty.
{
// 100 is just a number to give xtra weight to unmatched numbers.
returnValue += 100 + array1[array1Index].count;
}
}
// account for any extra unmatched array 2 values.
while (array2Index < array2.length)
{
// 100 is just a number to give xtra weight to unmatched numbers.
returnValue += 100 + array2[array2Index].count;
}
return returnValue;
}
精彩评论