Perl - Problem with splitting columns in tab delimited text file and replacing columns with new values
I have a tab delim. text file comprised of a number of rows and columns. I want to change the contents of the first two columns, then write the amended file to a new file.
Before changing , the first two columns of each line look something like this:COLUMN1:
dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5
COLUMN2: dip:DIP-48957N|uniprotkb:P49281
I want them to just contain the id number at the end of each column, so I want them to be as follows:
COLUMN1: Q96PU5
COLUMN 2: P49281
I have split the lines at the tab to get the individual columns. Then split the first 2 columns to get the required ID number ($prot_id). Then I have tried substituting the ID for the contents of columns 1 and 2. However the output in the changed file is not as I expect. It instead looks something like this:
COLUMN1:
Q96PU5|refseq:NP_056092|uniprotkb:Q96PU5
COLUMN 2:
P49281|uniprotkb:P49281
Just the first part of the columns has been substituted. I have been playing around with this for hours and cannot figure out what I'm doing wrong. Any help greatly appreciated. My code is as follows:
#!/usr/bin/perl
use开发者_StackOverflow中文版 warnings;
use strict;
my $file = 'DIP.txt';
open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt');
my @lines = <INFILE>;
foreach $_ (@lines) {
my @columns = split('\t', $_);
my $col1 = $columns[0];
my $col2 = $columns[1];
my @split_col1 = split ('uniprotkb:', $col1);
my @split_col2 = split ('uniprotkb:', $col2);
my $prot_id1 = $split_col1[length(@split_col1)];
my $prot_id2 = $split_col2[length(@split_col2)];
print $prot_id1, "\n";
s/$col1/$prot_id1/;
s/$col2/$prot_id2/;
print {$outfile} $_;
}
exit;
There's already some decent answers, but I'd like to show you a simpler solution. This script, you'd use like this:
$ script.pl DIP.txt > DIP_changed.txt
And the script itself is really just:
while (<>) {
s/\S+uniprotkb:(\S+)/$1/;
s/\S+uniprotkb:(\S+)/$1/;
print;
}
It doesn't need to be more complicated than that.
Try something like this:
This is a neat Perl idiom - match a string on a regular expression like this
$columns[0]=~/:((\w|\d)*)$/;
(note that there are two atoms defined there with the parentheses) and assign the results of the matches (whatever is in the 1st, 2nd, and so on atoms) to an array - or to a set of scalar variables in an array list, like this:
($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;
See, you were on the right track but you were making it harder than it needed to be :)
#!/usr/bin/perl
use warnings;
use strict;
my $file = 'DIP.txt';
open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt');
foreach my $line (<INFILE>) {
print "The input line is $line\n";
my @columns = split('\t', $line);
($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;
($columns[1]) = $columns[1]=~/:((\w|\d)*)$/;
printf "The output line is %s\n", join ',', @columns;
printf $outfile join ',', @columns;
}
ratsbane's answer was pretty good, but you probably want to know after hours of working why you got the answer you did. The reason is that $col1 had a pipe in it. That is an "OR" in a regex. So when you tried to substitute for the regex $col1, you were doing a find and replace over
dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5
Now as a regex, what does it match? It matches only
dip:DIP-41935N
so that is what got replaced!
Hope that helps!
There's probably no really good reason to slurp the file in at the beginning, rather than processing it line by line. Processing line by line will scale better. With that in mind, I would do it this way:
use warnings;
use strict;
my $file = 'DIP.txt';
open my $in_fh, '<', $file or die $!;
open my $out_fh, '>', 'new' . $file or die $!;
while ( <$in_fh> ) {
chomp;
next unless length $_; # Skip blank lines.
my ( @columns ) = split /\s+/, $_; # Split on whitespace (you may prefer \t).
foreach my $column ( @columns ) {
( $column ) = $column =~ m{([^:]+)$};
}
local $" = "\t";
print $out_fh "@columns\n";
}
First, this uses the three arg version of open on both the input file and the output file. This is a good habit to get into. Next, it uses lexical filehandles instead of the old fileglob filehandles. Lexicals auto-close when they pass out of scope, and don't become part of the global symbol table.
Next, the script reads the file and process it line by line, to avoid slurping. This could be advantageous if the file potentially grows large, or if you're in an environment where memory usage is at a premium. Unless you have a good reason to slurp, may as well get in the habit of not doing so.
Then I split on whitespace. You could split on tabs. Unless there's embedded whitespace in the columns either way works. Then I iterate over the two columns, matching and capturing from each everything at the end of the column that is not a colon. Or another way of putting it, everything that comes after the last colon. I capture the result right back into the $column variable, which aliases the corresponding element in @columns. That way, when I'm done @columns only holds my captures.
Finally, after processing the two columns, we localize $", assigning to it a tab character. That way when we print the two columns by wrapping @columns in quotes, the interpolation automatically sticks a tab character between the columns again. If you prefer a different character, you now know where to change it.
Then the while loop moves on to the next line. Any blank lines will be skipped.
see perldoc open, perlretut, perlvar, and perlop for an explanation of three-arg open as well as lexical filehandles, an explanation of regexps, Perl's special variables such as $", and how quotish interpolation works.
Good question!
精彩评论