Edit/Levenstein distance with time stamp - different paths with similar (minimal) cost
I am using the Edit/Levenstein distance to measure similarity between words. Unlike the simplest implementation, my letters have time stamps, let's say in sample numbers N=0,1,2,...
The problem I am facing is that I can get different paths along the cost matrix which end in the same (minimal) cost, and these different paths are associated with different target string. For example, if I measure the distance between the source string aa
and target string bab
, and I assume the source string starts on time stamp N=0, then I have 2 paths with the same cost of 2 (one addition and one substitution):
- Add
b
at N=-1, leave the 1sta
as it is, and substitute the 2nda
with ab
. - Substitute the 1st
a
with ab
, leave the 2nda
as it is, and addb
at N=2.
Aligned on the time line, these 2 results are different:
Time: ... -1 0 1 2 3 ...
Source: a a
Target1: b a b
Target2: b a b
I need to know when that happens, so I can choose between the two possible targets based on some criteria. Is there any other way other then tracing the path along the way and keeping track of all possible paths which lead to the minimal cost?
I've considered using Dynamic Time Warp instead, since the time-line is part of the model in the first place, but it seems that since the cost matrix is initialized to infinity (except for the [0,0] entry开发者_Python百科), the first step will always be matching the 1st frame of the target to the 1st frame of the source, resulting in the target starting at the same time stamp as the source. Anyway, using DTW I still have to trace all paths leading to the same minimal cost.
Any help or insights are welcomed.
Thinking some more about your problem, it seems a bit that there is a misunderstanding of DTW or Levensthein. Both algorithms try to squish and expand the sequences to match them with each other. So in the DTW case your example would have the following solutions:
Solution1:
a a
/| |
b a b
Solution2:
a a
| |\
b a b
Solution3:
a a
|\|\
b a b
If you have a look at these solutions, you will notice that all of these have a cost of 2, i.e. in all cases 2 b
s get assigned to as. What these examples mean is, that in the first sequence one timestamp gets squished together compared to the second sequence. For example in the first solution the first two timestamps of b a
get squished to form a single timestep corresponding to the first a
of the second sequence (the second sequence is just reversed, the third solution is more complex). DTW is meant to deal with sequences that are played at different speed at certain parts, hence the "time-warping" analogy.
If your timesteps are really fixed and you only need to align them, without any actual warping considered, you might just try all alignments and calculate the costs.
Something like this (assuming str2 to be the shorter one):
for i = 0 to length( str1 ) - length( str2 ) do
shift str2 by i to the left
calculate number of different position between shifted str2 and str1
done
return i with lowest difference
Assuming you need both shifting as well as warping (something might have been added to the beginning and the timesteps might not match), then considere subsequence DTW. For this you just need to relax the boundary conditions.
Assuming you index your string at one instead of zero you can write DTW like this:
diff( x, y ) = 1 if str1 at x != str2 at x
0 otherwise
cost( 0, 0 ) = 0;
cost( 0, * ) = infinity;
cost( *, 0 ) = infinity;
cost( x, y ) = min( cost( x-1, y-1 ), cost( x-1, y ), cost( y, y-1) ) + diff( x, y )
DTW-Cost then is cost( length( str1 ), length( str2 ) )
and your path can be traced back from there. For subsequence DTW you simply change this:
diff( x, y ) = 1 if str1 at x != str2 at x
0 otherwise
cost( 0, 0 ) = 0;
cost( 0, * ) = 0;
cost( *, 0 ) = infinity; // yes this is correct and needed
cost( x, y ) = min( cost( x-1, y-1 ), cost( x-1, y ), cost( y, y-1) ) + diff( x, y )
Then you pick your DTW-cost as min( cost( x, length( str2 ) )
and trace back from argmin( cost( x, length( str2 ) )
. This assumes you know one string to be the substring of the other. If you do not know this and both might only have a common warped middle you will have to do partial matching, which as far as I know is still a open research topic, since one needs to pick a notion of "optimality" which cannot clearly be defined.
精彩评论