Find and fix wrong values in a simple, linear data set

2023-03-08 18:31 问答作者：

This is probably a simple question yet I could not find a good approach.

I've got a limited number of ordered 开发者_StackOverflow社区int values that are supposed to be of similar distance to each other, e.g: 32, 42, 52, 62, 72, 82.

In reality though, some values are wrong. We might end up with 32, 51, 62, 66, 71, 83.

How can I find the obviously wrong value (in this case: 66) and move it to the correct position (42)?

It can be assumed that most data are still valid so it is still possible to calculate a good guess of the correct distance between points (here: 10).
The number of points is known and correct (i.e., we just need to move but not add or remove points).
The data boundaries to the left and to the right are unknown, behavior in edge cases can be defined freely.

While writing the question I thought of something. An idea might be to extract a function f(x) = a + x * b (that's easy) and iterate over the known number of points. The datum with the largest distance to an iterated point is removed and inserted at the iterated position which has the largest distance to an original point.

You can use robust regression, which is nothing more than a fancy term for "fitting a straight line to a bunch of points in such a way that points that don't fit well are gracefully removed".

If you don't want to write the non-linear optimization code, you can use iteratively reweighted least squares to leverage any existing weighted linear regression code you have lying around.

The idea is that you do weighted least squares to fit a straight line to your points. You then assign a weighting to each point that measures whether you think it's an outlier, deviating from the regression line too much (eg. via the Huber loss function). You then redo the regression with the weights. You'll get a new line and therefore can compute a new set of weights. Repeat until convergence (or a max number of iterations). You'll be left with weights that tell you which points are bad, and a line that nicely fits the remaining points and which can be used to replace the outliers.

I think the implementation isn't vastly longer than the text description above.

If only one datum is wrong, and assuming increasing values (as in your example): The data goes in DATA and DATA_SIZE, and THRESHOLD is the deviation allowed

#include <stdio.h>
#define THRESHOLD 3

#define DATA 32, 51, 62, 66, 71, 83
#define DATA_SIZE 6
void main()
{
    int data[]={DATA}; int size = DATA_SIZE;
    int skip = 0, diffs, curDif, maxDif, lastItem, item, dif, maxPos;
    int maxDiffs = 10000, location, newPosition, newValue;
    for(skip = 0; skip < size; skip++)
    {
      diffs = 0;
      curDif = 0;
      maxDif = 0;
      maxPos = -1;
      lastItem = (skip == 0);
      for(item = lastItem+1; item < size; item++)
      {
        if(item == skip)continue;
        dif = data[item]-data[lastItem];
        if(abs(dif - curDif) > THRESHOLD)
        {
          curDif = dif;
          diffs++;
          if(curDif > maxDif)
          {
            maxDif = curDif;
            maxPos = item;
          }
        }
        lastItem = item;
      }

      if(diffs < maxDiffs)
      {
          maxDiffs = diffs;
          location = skip;
          newPosition = maxPos;
          newValue = data[maxPos-1]+(maxDif>>1);
      }
    }
    printf("Found... \nindex %d\nValue: %d\nGoes in:%d\nNew value:%d\n", location, data[location], newPosition, newValue);
}

I experimented with a lot of different approaches, this is what I ended up with. The basic idea is to assign good, valid values to the array of expected values. Values that could not be assigned get fixed by using the missing expected values instead.

Given is a list of actual data peaks.

Build a list of expected data

var expected = Enumerable
    // 19 is the known number of values
    .Range (0, 19)
    // simply interpolate over the actual data
    .Select (x => peaks.First () + x * (peaks.Last () - peaks.First ()) / 18)
    .ToList ();

Build a matrix of the distances of all points

var distances = expected.SelectMany (dst => peaks.Select (src => new {
    Expected = dst,
    Original = src,
    Distance = Math.Abs (dst - src)
}));

Repeat

for (;;)
{

Select the best distance

var best = distances
    // ignore really bad values
    .Where (x => x.Distance < dAvgAll * 0.3)
    .OrderBy (x => x.Distance).FirstOrDefault ();

If no good assignation was found, quit

if (best == null) {
    break;
}

Else, store the match

expected.Remove (best.Expected);
peaks.Remove (best.Original);

}

All valid entries in our source have been identified and removed. We simply use the left-over values in the expected set and ignore the left-over original values to finish our final data set.

Other attempted approaches, including a version adapted from gusbro's, worked less well and often displayed bad behavior for me.

I will try to outline an algorithm (I don't know if it would give a correct result for every input sequence, therefor think of it as an idea):

Input for the algorithm is the ordered sequence R. For Example { 32, 51, 62, 66, 71, 83 }

Find distance d between points. I'm thinking of:
- Sort the differences between the elements and take the median.
  Sorted differences = { 4, 5, 11, 12, 19 } --> Median = 11
- Or calculate the mean value of the differences.
  Mean Value = 10.2 --> Rounded Mean Value = 10
Build the mean value m of the elements of R.
In our example (32 + 51 + 62 + 66 + 71 + 83) / 6 = 30.2
Rounded = 30
Build a comparative squence S where the first element S_0 has the value m - (n / 2) * d (where n is the number of elements) and any further element S_i has the value S_1 + i * d.
In our example S = { 30, 40, 50, 60, 70, 80 }
Because the elements in the input sequence could have moved to another position, build every permutation of R
Find the permutation where the number of outliers is minimal (outlier is element, where element difference is greater 0.3 * d

                     S = { 30, 40, 50, 60, 70, 80 } 
    permutation x of R = { 32, 51, 62, 66, 71, 83 } three outliers
    permutation y of R = { 32, 66, 51, 62, 71, 83 } one outlier
    permutation z of R = ...

The result of the algorithm in this example would be permutation y and with it the correct position of the element 66 is found.

继续阅读：algorithm error-correction

Find and fix wrong values in a simple, linear data set

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？