Does a regular expression exist for enzymatic cleavage?
Does a regular expression exist for (theoretical) tryptic cleavage of protein sequences? The cleavage rule for trypsin is: after R or K, but not before P.
Example:
Cleavage of the sequence VGTKCCTKPESERMPCTEDYLSLILNR
should result in these 3 sequences (peptides):
VGTK
CCTKPESER
MPCTEDYLSLILNR
Note that there is no cleavage after K in the second peptide (because P comes after K).
In Perl (it could just as well have been in C#, Python or Ruby):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
my @peptides = split /someRegularExpression/, $seq;
I have used this work-around (where a cut marker, =, is first inserted in the sequence and removed again if P is immediately after the cut maker):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
$seq =~ s/([RK])/$1=/g; #Main cut rule.
$seq =~ s/=P/P/g; #The exception.
my @peptides = split( /=/, $seq);
But this requires modification to a string that can potentially be 开发者_如何转开发very long and there can be millions of sequences. Is there a way where a regular expression can be used with split? If yes, what would the regular expression be?
Test platform: Windows XP 64 bit. ActivePerl 64 bit. From perl -v: v5.10.0 built for MSWin32-x64-multi-thread.
You indeed need to use the combination of a positive lookbehind and a negative lookahead. The correct (Perl) syntax is as follows:
my @peptides = split(/(?!P)(?<=[RK])/, $seq);
You could use look-around assertions to exclude that cases. Something like this should work:
split(/(?<=[RK](?!P))/, $seq)
You can use lookaheads and lookbehinds to match this stuff while still getting the correct position.
/(?<=[RK])(?!P)/
Should end up splitting on a point after an R or K that is not followed by a P.
In Python you can use the finditer
method to return non-overlapping pattern matches including start and span information. You can then store the string offsets instead of rebuilding the string.
精彩评论