开发者

Extract a part of string using a pattern, start with a value of variable in Perl

I am very new in Perl and I am confused how to do this task. I have two files:

  1. Seq.txt, which contains many sequences (database)
  2. PID.txt, which contains only IDs (query) for some sequences which I need to extract from Seq.txt file.

Here I am giving a small part of my both files:

Seq.txt contains:

'>' SCO0700,  probable ABC transporter protein, ATP-binding component. 
MASSMEKPLDHRYRGEHPIRTLVYLFRADRRRLAGAVAVFTVKHSPIWLLPLVTAAIVDT
VVQHGPITDLWTSTGLIMFILVVNYPLHLLYVRLLYGSVRRMGTALRSALCTRMQQLSIG  
'>' SCO0755, putative ABC transporter 797720:799942 forward MW:79858
VSTAQETRGRRRAAPPRRSVPKSRARTVRTPTVLQMEAVECGAASLAMVLGHYGRHVPLE
ELRIACGVSRDGSRASNLLKAARSYGFTAKGMQMDLAALAEVTAPAILFWEFNHYVVYDG  
'>' SCO2305,putative ABC transporter ATP-binding subunit 2474063:2474989 forward MW:32345
MRPTEGTTPAVAFTGAAKAYGDVRAVDGVDLRIGCGETVALLGRNGAGKSTTIALLLGLC
PPDAGTVELFGGPAERAVRAGRVGAMLQEARAVPRVTVGELVAFVAGRYPAPMPVGQALE   
'>' SCO1144, putative ABC transporter ATP-binding protein 1202480:1204282 reverse MW:64637
MHPDRESAWTAPADAVEQPRQVRRILKLFRPYRGRLAVVGLLVGAASLVSVATPFLLKEI
LDVAIPEGRTGLLSLLALGMIFGAVLTSVFGVLQTLISTTVGQRVMHDLRT开发者_开发知识库AVYGRLQQM  
'>' SCO1148, putative ABC transporter 1207772:1209553 forward MW:63721
MIGVAPPSYDPAAPTTANTLPVGARPTVRAYVGELLRRHRRAFLFLVTVNTVAVIASMAG
PYLLGGLVERVSDDARELRLGLTATLFVLALVVQAVFVREVRLRGAVLGERMLADLREDF

PID.txt contains:

SCO0755  
SCO1144

Code I have written:

open (PID, 'PID.txt');  
my @PID = '<'PID'>';  
close(PID);  
open (MSD, 'Seq.txt');  
my @MSD = '<'MSD'>';    
close(MSD);  
chomp(@MSD);  
my $MSD=join (' ', @MSD);  
print "$MSD \n";  
for ($i = 0; $i<=2; $i++) {  
  my $a=$PID[$i];  
  if ($MSD =~ m/$a(.*?)>/)  # ">" end of the string  
  {  
    print "$1 \n";  
    $output= ">".$a.$1;  
    print $output;  
    open (MYFILE, '>>data.txt');  
    print MYFILE "$output\n";    
    close (MYFILE);    
  }  
}

Why is it not recognizing $a? If I put [$a], then the binding operator recognize $a but do not return my desired sequence (with ID stored in $a), instead it returns the very first sequence.

The result I expect is:

'>' SCO0755, putative ABC transporter 797720:799942 forward MW:79858
VSTAQETRGRRRAAPPRRSVPKSRARTVRTPTVLQMEAVECGAASLAMVLGHYGRHVPLE
ELRIACGVSRDGSRASNLLKAARSYGFTAKGMQMDLAALAEVTAPAILFWEFNHYVVYDG  
'>' SCO1144, putative ABC transporter ATP-binding protein 1202480:1204282 reverse MW:64637
MHPDRESAWTAPADAVEQPRQVRRILKLFRPYRGRLAVVGLLVGAASLVSVATPFLLKEI
LDVAIPEGRTGLLSLLALGMIFGAVLTSVFGVLQTLISTTVGQRVMHDLRTAVYGRLQQM


First, don't use $a and $b in your code. They're special variables that only make sense within a sort block; avoid them elsewhere, use meaningful variable names instead.

Secondly,

my @PID = '<'PID'>';

Assuming you're trying to read the contents of the filehandle PID into an array, you mean:

my @PID = <PID>;

Thirdly, common best practice these days is to use 3-arg open and lexical filehandles, for instance:

open(my $pidfh, '<', 'PID.txt') or die "...";
my @PID = <$pidfh>;
close $pidfh;

Do you have use strict; at the top of your script?

For what it's worth, I'd read the PIDs you're interested in into a hash for easy lookups, then loop through Seq.txt; remember what entry you're looking at and storing its contents; each time you see a new entry, see if the previous one you've built up is one you want, and if so, print it. This way you needn't hold the contents of the file in memory, which will be useful if it's a very large file.

Something roughly like the following:

#!/usr/bin/perl
use strict;

# Read in a list of PIDs we're interested in
my %want_pid;
open(my $pidfh, '<', 'PID.txt') or die "Failed to open PID.txt - $!";
while (my($pid) = <$pidfh> =~ m{([A-Z0-9]+)}) {
    $want_pid{$pid}++ if $pid;
}

# Now process the file and print entries we want
open(my $seqfh, '<', 'Seq.txt') or die "Failed to open Seq.txt - $!";
my $current_pid;
my $current_text;
while (my $line = <$seqfh>) {
    if (my ($new_pid) = $line =~ m{^ '>' \s+ ([A-Z0-9]+) , }x) {
        # We're at the start of a new entry; if the last one is one we want, 
        # print it.
        if ($want_pid{$current_pid}) {
            print $current_text;
        }

        $current_pid = $new_pid;
        $current_text = $line;
    } else {
        # It's a continuation of an entry
        $current_text .= $line;
    }
}

close $seqfh;

(Room for improvement, but it should get you on the right track.)


I can't tell you why you're getting the output that you're getting because the code you posted is not valid Perl and won't compile or run - my @PID = '<'PID'>'; is syntactically invalid. (It should be my @PID = <PID>;, without any quotes.) Therefore, it is obviously not the code that you're running to produce those results.

The reason you're not getting any matches is because, although the posted code does chomp(@MSD), it doesn't also chomp(@PID), so the PIDs will only match if they're followed by a newline. Which, in the posted data, they aren't. (And, even if they were, the chomp(@MSD) would remove them.)

Fixing this moves you a step closer, but still doesn't produce the results you want because your regex is wrong. Try this one instead (with $a renamed to $target because a: it's a more meaningful name and b: $a and $b are magic, so you shouldn't use them): m/'>' $target([^']*)/

Finally, your for ($i...) loop is incorrect, which is a very easy mistake to make with C-style for. Much better to use for (list) instead.

Fixing all these things, as well as switching to lexical filehandles and the three-argument form of open (as already mentioned by David Precious) and doing some general code cleanup, gives us:

#!/usr/bin/env perl

use strict;
use warnings;

open my $pid_fh, '<', 'PID.txt';
my @PID = <$pid_fh>;
close $pid_fh;
chomp(@PID);

open my $msd_fh, '<', 'Seq.txt';
my @MSD = <$msd_fh>;
close $msd_fh;
chomp(@MSD);

my $msd = join(' ', @MSD);
my $output;
open my $outfile, '>>', 'data.txt';
for my $target (@PID) {
    if ($msd =~ m/'>' $target([^']*)/) {
        $output = ">" . $target . $1;
        print $output, "\n";
        print $outfile "$output\n";
    }
}

...which produces the output:

>SCO0755, putative ABC transporter 797720:799942 forward MW:79858 VSTAQETRGRRRAAPPRRSVPKSRARTVRTPTVLQMEAVECGAASLAMVLGHYGRHVPLE ELRIACGVSRDGSRASNLLKAARSYGFTAKGMQMDLAALAEVTAPAILFWEFNHYVVYDG   
>SCO1144, putative ABC transporter ATP-binding protein 1202480:1204282 reverse MW:64637 MHPDRESAWTAPADAVEQPRQVRRILKLFRPYRGRLAVVGLLVGAASLVSVATPFLLKEI LDVAIPEGRTGLLSLLALGMIFGAVLTSVFGVLQTLISTTVGQRVMHDLRTAVYGRLQQM

The correct sequences are selected; I'll leave getting them formatted exactly as you requested as an exercise for the reader.


Test if this works for you:

use warnings;
use strict;

die "Usage: $0 <pid file> <seq file>\n" unless @ARGV == 2;

open my $pid, "<", $ARGV[0] or die "Error: Cannot open file $ARGV[0]: $!\n";
open my $seq, "<", $ARGV[1] or die "Error: Cannot open file $ARGV[1]: $!\n";

my %pid = ();
while ( <$pid> ) {
    chomp;
    s/^\s*(\S*)\s*$/$1/;
    ++$pid{$_};
}


$/ = "\'>\'";
foreach ( <$seq> ) {
    $_ = substr $_, 0, -3;
    my ($p) = split /\,/;
    $p =~ /(\S+)/;
    print "'>'", $_ if exists $pid{$1};
}

Regards,

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜