Fast(er) algorithm for the Length of the Longest Common Subsequence (LCS)
Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.
My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.
I have started off by implementing in the usual dynamic programming approach. That gives the correct answer and is hopefully implemented in properly.
#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101
static int MaxLength(int lengthA, int lengthB);
/*
* Then the two strings are compared following a dynamic computing
* LCS table algorithm. Since we only require the length of the LCS
* we can get this rather easily.
*/
int LCS_Length(char *a, char *b)
{
int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b),
LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];
maxLength = MaxLength(lengthA, lengthB);
//printf("%d %d\n", lengthA, lengthB);
for (i = 0; i < maxLength - 1; i++)
{
board[i][0] = 0;
board[0][i] = 0;
}
for (i = 1; i < lengthA; i++)
{
for (j = 1; j < lengthB; j++)
{
/* If a match is found we allocate the number in (i-1, j-1) incremented
* by 1 to the (i, j) position
*/
if (a[i - 1] == b[j - 1])
{
board[i][j] = board[i-1][j-1] + 1;
if(LCS < board[i][j])
{
LCS++;
}
}
else
{
if (board[i-1][j] > board[i][j-1])
{
board[i][j] = board[i-1][j];
}
else
{
board[i][j] = board[i][j-1];
}
}
}
}
return LCS;
}
That should be O(mn) (hopefully).
Then in search for speed I found this post Longest Common Subsequence Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES). The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define arrayLengthMacro(a) strlen(a)
int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);
int main(void)
{
int L;
char a[] = "tomato", b[] = "potato"; //must give LCS = 4
L = LCS_Length(a, b);
printf("LCS: %d\n", L );
char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
L = LCS_Length(c, d);
printf("LCS: %d\n", L );
char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
L = LCS_Length(e, f);
printf("LCS: %d\n", L );
char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
L = LCS_Length(g, h);
printf("LCS: %d\n", L );
char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGG开发者_StackOverflow社区GTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC",
j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
L = LCS_Length(i, j);
printf("LCS: %d\n", L );
return 0;
}
int LCS_Length(char *a, char *b)
{
int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b),
max, *V_, *V, i, k, x, y;
max = lengthA + lengthB;
V_ = malloc(sizeof(int) * (max+1));
if(V_ == NULL)
{
fprintf(stderr, "Failed to allocate memory for LCS");
exit(1);
}
V = V_ + lengthA;
V[1] = 0;
for (D = 0; D < max; D++)
{
for (k = -D; k <= D; k = k + 2)
{
if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
{
x = V[k+1];
}
else
{
x = V[k-1] + 1;
}
y = x - k;
while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
{
x++;
y++;
}
V[k] = x;
if ((x >= lengthB) && (y >= lengthA))
{
return (lengthA + lengthB - D)/2;
}
}
}
return (lengthA + lengthB - D)/2;
}
There are examples in the main. Eg. "tomato" and "potato" (don't comment), have an LCS length of 4. The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.
Any approach that is implementable (I don't have immense programming skills), is welcome. (If someone would know how to take advantage of the small alphabet for example).
EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either. I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.
EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).
I'm not familiar with the fancier-than-dynamic-programming algorithms for computing LCS, but I wanted to point out a few things:
First, the O(ND) approach only makes sense if you're comparing very large, very similar strings. This doesn't seem to be the case for you.
Second, speeding up the asymptotic performance of your LCD_Length function is probably not what you should be focusing on since your strings are pretty short. If you only care about finding similar or dissimilar pairs (and not all pairs' exact LCS), then the BK-tree mentioned by Yannick looks like a promising way to go.
Finally, some things bothered me about your DP implementation. The correct interpretation of "board[i][j]" in your code is "the longest subsequence of strings a[1..i], b[1..j]" (I'm using 1-indexing in this notation). Therefore, your for loops should include i = lengthA and j = lengthB. It looks like you hacked around this bug in your code by introducing arrayLengthMacro(a), but that hack doesn't make sense in the context of the algorithm. Once "board" is filled, you can look up the solution in board[lengthA][lengthB], which means you can get rid of the unnecessary "if (LCS < board[i][j])" block and return board[lengthA][lengthB]. Also, the loop bounds look wrong in the initialization (I'm pretty sure it should be for (i = 0; i <= maxLength; i++) where maxLength = max(lengthA, lengthB)).
I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.
There several ways to make your computation faster:
- Instead of plain dynamic programming, you can use A* search (by using a heuristic that partial alignment up to (i,j) must necessarily have |i-j| deletions or insertions in it).
- If you're comparing one sequence with a whole bunch of other ones, you can save time by computing the dynamic programming table (or the A* search state) for the part leading up to that prefix and re-use the part of your computation. If you stick with the dynamic programming table, you could sort the library of strings by lexicographic order and then only throw away the part that changes (e.g. if you have 'banana' and want to compare it with 'panama' and 'panamericanism', you can reuse the part of the table that covers 'panam').
- If most of the strings are quite similar, you can save time by looking for a common prefix and excluding the common prefix from the computation (e.g. the LCS of "panama" and "panamericanism" is the common prefix "panam" plus the LCS of "a" and "ericanism")
- if the strings are quite dissimilar, you can use the symbol counts to get a lower bound on the number of edits (e.g., "AAAB" to "ABBB" need at least 2 edits because there are 3 As in one and only 1 A in the other). This can also be used in the heuristic for an A* search.
EDIT: for the comparing-to-the-same-set-of-stings case, one person suggested a BK-Tree data structure in Efficient way of calculating likeness scores of strings when sample size is large?
精彩评论