Measuring Error Rates Between Rank-Order Lists
I'm trying to measure the agreement between two different systems of classification (one of them based on machine-learning algorithms, and the other based on human ground-truthing), and I'm looking for input from someone who's implemented a similar sort of system.
The classification schema allows each item to be classified into multiple different nodes in a category taxonomy, where each classification carries a weight coefficient. For example, if some item can be classified into four different taxonomy nodes, the result might look like this for the algorithmic and ground-truth classifiers:
ALGO TRUTH
CATEGORY A: 0.35 0.50
CATEGORY B: 0.30 0.30
CATEGORY C: 0.25 0.15
CATEGORY D: 0.10 0.05
The weights will always add up to exactly 1.0, for all selected category nodes (of which there are about 200 in the classification taxonomy).
In the example above, it's important to note that both lists agree about rank ordering (ABCD), so they should be scored as being in strong agreement with one another (even though there are some differences in the weights assigned to each category. By contrast, in the next example, the two classifications are in complete disagreement with respect to rank-order:
ALGO TRUTH
CATEGORY A: 0.40 0.10
CATEGORY B: 0.35 0.15
CATEGORY C: 0.15 0.35
CATEGORY D: 0.10 0.40
So a result like this should get a very low score.
One final example demonstrates a common case where the human-generated ground-truth contains duplicate weight values:
ALGO TRUTH
CATEGORY A: 0.40 0.50
CATEGORY B: 0.35 0.50
CATEGORY C: 0.15 0.00
CATEGORY D: 0.10 0.00
So it's important that the algorithm allows lists without perfect rank ordering (since the ground truth could be validly interpreted as ABCD, ABDC, BACD, or BADC)
Stuff I've tried so far:
Root Mean Squared Error (RMSE): Very problematic. It doesn't account for rank-order agreement, which means that gross di开发者_高级运维sagreements between categories at the top of the list are swept under the rug by agreement about categories at the bottom of the list.
Spearman's Rank Correlation: Although it accounts for differences in rank, it gives equal weight to rank agreements at the top of the list and those at the bottom of the list. I don't really care much about low-level discrepancies, as long as the high-level discrepancies contribute to the error metric. It also doesn't handle cases where multiple categories can have tie-value ranks.
Kendall Tau Rank Correlation Coefficient: Has the same basic properties and limitations as Spearman's Rank Correlation, as far as I can tell.
I've been thinking about rolling my own ad-hoc metrics, but I'm no mathematician, so I'd be suspicious of whether my own little metric would provide much rigorous value. If there's some standard methodology for this kind of thing, I'd rather use that.
Any ideas?
Okay, I've decided to implement a weighted RMSE. It doesn't directly account for rank-ordering relationships, but the weighting system automatically emphasizes those entries at the top of the list.
Just for review (for anyone not familiar with RMSE), the equation looks like this, assuming two different classifiers A and B, whose results are contained in an array of the same name:
RMSE Equation http://benjismith.net/images/rmse.png
In java the implementation looks like this:
double[] A = getAFromSomewhere();
double[] B = getBFromSomewhere();
// Assumes that A and B have the same length. If not, your classifier is broken.
int count = A.length;
double sumSquaredError = 0;
for (int i = 0; i < count; i++) {
double aElement = A[i];
double bElement = B[i];
double error = aElement - bElement;
double squaredError = error * error;
sumSquaredError += squaredError;
}
double meanSquaredError = sumSquaredError / count;
double rootMeanSquaredError = Math.sqrt(meanSquaredError);
That's the starting point for my modified implementation. I needed to come up with a weighting system that accounts for the combined magnitude of the two values (from both classifiers). So I'll multiply each squared-error value by SQRT(Ai^2 + Bi^2)
, which is a plain-old Euclidean distance function.
Of course, since I use a weighted error in the numerator, I need to also use the sum of all weights in the denominator, so that my results are renormalized back into the (0.0, 1.0) range.
I call the new metric "RMWSE", since it's a Root Mean Weighted Squared Error. Here's what the new equation looks like:
RMWSE Equation http://benjismith.net/images/rmwse.png
And here's what it looks like in java:
double[] A = getAFromSomewhere();
double[] B = getBFromSomewhere();
// Assumes that A and B have the same length. If not, your classifier is broken.
int count = A.length;
double sumWeightedSquaredError = 0;
double sumWeights = 0;
for (int i = 0; i < count; i++) {
double aElement = A[i];
double bElement = B[i];
double error = aElement - bElement;
double squaredError = error * error;
double weight = Math.sqrt((aElement * aElement) + (bElement * bElement));
double weightedSquaredError = weight * squaredError;
sumWeightedSquaredError += weightedSquaredError;
sumWeights += weight;
}
double meanWeightedSquaredError = sumWeightedSquaredError / sumWeights;
double rootMeanWeightedSquaredError = Math.sqrt(meanWeightedSquaredError);
To give you an idea for how this weight works in practice, let's say my two classifiers produce 0.95
and 0.85
values for some category. The error between these two values is 0.10
, but the weight is 1.2748
(which I arrived at using SQRT(0.95^2 + 0.85^2)
). The weighted error is 0.12748
.
Likewise, if the classifiers produce 0.45
and 0.35
for some other category, the error is still just 0.10
, but the weight is only 0.5701
, and the weighted error is therefore only 0.05701
.
So any category with high values from both classifiers will be more heavily weighted than categories with a high value from only a single classifier, or categories with low values from both classifiers.
This works best when my classification values are renormalized so that the maximum values in both A and B are 1.0, and all the other values are scaled-up, proportionately. Consequently, the dimensions no longer sum up to 1.0 for any given classifier, but that doesn't really matter anyhow, since I wasn't exploiting that property for anything useful.
Anecdotally, I'm pretty happy with the results this is giving in my dataset, but if anyone has any other ideas for improvement, I'd be totally open to suggestions!
I don't think you need to worry about rigour to this extent. If you want to weight certain types of agreement more than others, that is perfectly legitimate.
For example, only calculate Spearman's for the top k categories. I think you should get perfectly legitimate answers.
You can also do a z-transform etc. to map everything to [0,1] while preserving what you consider to be the "important" pieces of your data set (variance, difference etc.) Then you can take advantage of the large number of hypothesis testing functions available.
(As a side note, you can modify Spearman's to account for ties. See Wikipedia.)
精彩评论