开发者

Smoothing of Sequences

I think there should be an algorithm for this out there - probably in a field like bioinformatics (the problem reminds me a bit of sequence alignment) so I hope someone can help me out here.

The problem is as follows: Assume I have classified some data into two different classes X and Y. The result of this may look something like this: ..XXX Y XXX.. Further assume that we have some domain knowledge about those classes and know that it's extr开发者_开发知识库emely unlikely to have less than a certain number of instances in a row (ie it's unlikely that there are less than 4 Xs or Ys in a sequence - preferably I could use a different threshold per class but that's not a must). So if we use this domain knowledge it's "obvious" that we'd like to replace the single Y in the middle with a X.

So the algorithm should take a sequence of classified instances and the thresholds for the classes (or 1 threshold for all if it simplifies the problem) and try to find a sequence that fulfills the property (no sequences of classes shorter than the given threshold). Obviously there can be an extremely large number of correct solutions (eg in the above example we could also replace all X with a Y) so I think a reasonable optimization criterium would be to minimize the number of replacements.

I don't need an especially efficient algorithm here since the number of instances will be rather small (say < 4k) and we'll only have two classes. Also since this is obviously only a heuristic I'm fine with some inaccuracies if they vastly simplify the algorithm.


A very similar problem to this can be solved as a classic dynamic programming shortest path problem. We wish to find the sequence which minimises some notion of cost. Penalise each character in the sequence that is different from the corresponding character in the original sequence. Penalise each change of character in the sequence, so penalise each change from X to Y and vice versa.

This is not quite what you want because the penalty for YYYXYYY is the same as the penalty for YXXXXXXY - one penalty for YX and one for XY - however it may be a good approximation because e.g. if the base sequence says YYY....YXY....YY then it will be cheaper to change the central X to a Y than to pay the cost of XY and YX - and you can obviously fiddle with the different cost penalties to get something that looks plausible.

Now think of each position in the sequence as being two points, one above the other, one point representing "X goes here" and one representing "Y goes here". You can link points with lines of cost depending on whether the corresponding character is X or Y in the original sequence, and whether the line joins an X with an X or an X with a Y or so on. Then work out the shortest path from left to right using a dynamic program that works out the best paths terminating in X and Y at position i+1, given knowledge of the cost of the best paths terminating in X and Y at position i.

If you really want to penalise short lived changes more harshly than long lived changes you can probably do so by increasing the number of points in the path-finding representation - you would have points that correspond to "X here and the most recent Y was 3 characters ago". But depending on what you want for a penalty you might end up with an incoveniently large number of points at each character.


You can use dynamic programming as in the following pseudocode sketch (for simplicity, this code assumes the threshold is 3 Xs or Ys in a row, rather than 4):

min_switch(s):
  n = len(s)
  optx = array(4, n, infinity) // initialize all values to infinity
  opty = array(4, n, infinity) // initialize all values to infinity
  if s[0] == 'X':
    optx[1][0] = 0
    opty[1][0] = 1
  else:
    optx[1][0] = 1
    opty[1][0] = 0
  for i in {1, n - 1}:
    x = s[i]
    if x == 'X':
      optx[1][i] = opty[3][i - 1]
      optx[2][i] = optx[1][i - 1]
      optx[3][i] = min(optx[2][i - 1], optx[3][i - 1])
      opty[1][i] = 1 + min(optx[1][i - 1], optx[2][i - 1], optx[3][i - 1])
      opty[2][i] = 1 + opty[1][i - 1]
      opty[3][i] = 1 + min(opty[2][i - 1], opty[3][i - 1])
    else:
      optx[1][i] = 1 + min(opty[1][i - 1], opty[2][i - 1], opty[3][i - 1])
      optx[2][i] = 1 + opty[1][i - 1]
      optx[3][i] = 1 + min(opty[2][i - 1], opty[3][i - 1])
      opty[1][i] = optx[3][i - 1]
      opty[2][i] = opty[1][i - 1]
      opty[3][i] = min(opty[2][i - 1], opty[3][i - 1])
  return min(optx[3][n - 1], opty[3][n - 1])

The above code essentially computes the lowest cost of creating a smooth sequence up to the ith character storing the optimal value for all relevant numbers of consecutive Xs or Ys in a row (1, 2, or 3 in a row). More formally

  • opt[i][0][k] stores the smallest cost to convert the string s[0...k] into a smooth sequence then ends in i consecutive Xs. Runs of 3 or more are accounted for in opt[3][0][k].
  • opt[0][j][k] stores the smallest cost to convert the string s[0...k] into a smooth sequence then ends in j consecutive Ys. Runs of 3 or more are accounted for in opt[0][3][k].

It is straightforward to convert this to an algorithm that returns the sequence as well as the optimal cost.

Note that some of the cases in the above code are probably unnecessary, it's just a straightforward recurrence derived from the constraints.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜