Find and fix wrong values in a simple, linear data set
This is probably a simple question yet I could not find a good approach.
I've got a limited number of ordered 开发者_StackOverflow社区int values that are supposed to be of similar distance to each other, e.g: 32, 42, 52, 62, 72, 82
.
In reality though, some values are wrong. We might end up with 32, 51, 62, 66, 71, 83
.
How can I find the obviously wrong value (in this case: 66) and move it to the correct position (42)?
- It can be assumed that most data are still valid so it is still possible to calculate a good guess of the correct distance between points (here: 10).
- The number of points is known and correct (i.e., we just need to move but not add or remove points).
- The data boundaries to the left and to the right are unknown, behavior in edge cases can be defined freely.
While writing the question I thought of something. An idea might be to extract a function f(x) = a + x * b
(that's easy) and iterate over the known number of points. The datum with the largest distance to an iterated point is removed and inserted at the iterated position which has the largest distance to an original point.
You can use robust regression, which is nothing more than a fancy term for "fitting a straight line to a bunch of points in such a way that points that don't fit well are gracefully removed".
If you don't want to write the non-linear optimization code, you can use iteratively reweighted least squares to leverage any existing weighted linear regression code you have lying around.
The idea is that you do weighted least squares to fit a straight line to your points. You then assign a weighting to each point that measures whether you think it's an outlier, deviating from the regression line too much (eg. via the Huber loss function). You then redo the regression with the weights. You'll get a new line and therefore can compute a new set of weights. Repeat until convergence (or a max number of iterations). You'll be left with weights that tell you which points are bad, and a line that nicely fits the remaining points and which can be used to replace the outliers.
I think the implementation isn't vastly longer than the text description above.
If only one datum is wrong, and assuming increasing values (as in your example): The data goes in DATA and DATA_SIZE, and THRESHOLD is the deviation allowed
#include <stdio.h>
#define THRESHOLD 3
#define DATA 32, 51, 62, 66, 71, 83
#define DATA_SIZE 6
void main()
{
int data[]={DATA}; int size = DATA_SIZE;
int skip = 0, diffs, curDif, maxDif, lastItem, item, dif, maxPos;
int maxDiffs = 10000, location, newPosition, newValue;
for(skip = 0; skip < size; skip++)
{
diffs = 0;
curDif = 0;
maxDif = 0;
maxPos = -1;
lastItem = (skip == 0);
for(item = lastItem+1; item < size; item++)
{
if(item == skip)continue;
dif = data[item]-data[lastItem];
if(abs(dif - curDif) > THRESHOLD)
{
curDif = dif;
diffs++;
if(curDif > maxDif)
{
maxDif = curDif;
maxPos = item;
}
}
lastItem = item;
}
if(diffs < maxDiffs)
{
maxDiffs = diffs;
location = skip;
newPosition = maxPos;
newValue = data[maxPos-1]+(maxDif>>1);
}
}
printf("Found... \nindex %d\nValue: %d\nGoes in:%d\nNew value:%d\n", location, data[location], newPosition, newValue);
}
I experimented with a lot of different approaches, this is what I ended up with. The basic idea is to assign good, valid values to the array of expected values. Values that could not be assigned get fixed by using the missing expected values instead.
Given is a list of actual data peaks
.
Build a list of expected data
var expected = Enumerable
// 19 is the known number of values
.Range (0, 19)
// simply interpolate over the actual data
.Select (x => peaks.First () + x * (peaks.Last () - peaks.First ()) / 18)
.ToList ();
Build a matrix of the distances of all points
var distances = expected.SelectMany (dst => peaks.Select (src => new {
Expected = dst,
Original = src,
Distance = Math.Abs (dst - src)
}));
Repeat
for (;;)
{
Select the best distance
var best = distances
// ignore really bad values
.Where (x => x.Distance < dAvgAll * 0.3)
.OrderBy (x => x.Distance).FirstOrDefault ();
If no good assignation was found, quit
if (best == null) {
break;
}
Else, store the match
expected.Remove (best.Expected);
peaks.Remove (best.Original);
}
All valid entries in our source have been identified and removed. We simply use the left-over values in the expected set and ignore the left-over original values to finish our final data set.
Other attempted approaches, including a version adapted from gusbro's, worked less well and often displayed bad behavior for me.
I will try to outline an algorithm (I don't know if it would give a correct result for every input sequence, therefor think of it as an idea):
Input for the algorithm is the ordered sequence R
. For Example { 32, 51, 62, 66, 71, 83 }
Find distance
d
between points. I'm thinking of:- Sort the differences between the elements and take the median.
Sorted differences = { 4, 5, 11, 12, 19 } --> Median = 11 - Or calculate the mean value of the differences.
Mean Value = 10.2 --> Rounded Mean Value = 10
- Sort the differences between the elements and take the median.
Build the mean value
m
of the elements ofR
.
In our example (32 + 51 + 62 + 66 + 71 + 83) / 6 = 30.2
Rounded = 30Build a comparative squence
S
where the first elementS_0
has the valuem - (n / 2) * d
(wheren
is the number of elements) and any further elementS_i
has the valueS_1 + i * d
.
In our exampleS
= { 30, 40, 50, 60, 70, 80 }Because the elements in the input sequence could have moved to another position, build every permutation of
R
Find the permutation where the number of outliers is minimal (outlier is element, where element difference is greater
0.3 * d
S = { 30, 40, 50, 60, 70, 80 }
permutation x of R = { 32, 51, 62, 66, 71, 83 } three outliers
permutation y of R = { 32, 66, 51, 62, 71, 83 } one outlier
permutation z of R = ...
The result of the algorithm in this example would be permutation y and with it the correct position of the element 66 is found.
精彩评论