Perl - Problem with splitting columns in tab delimited text file and replacing columns with new values

2023-03-18 08:45 问答作者：

I have a tab delim. text file comprised of a number of rows and columns. I want to change the contents of the first two columns, then write the amended file to a new file.

Before changing , the first two columns of each line look something like this:

COLUMN1:                                              
dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5    

COLUMN2:    dip:DIP-48957N|uniprotkb:P49281

I want them to just contain the id number at the end of each column, so I want them to be as follows:

COLUMN1:        Q96PU5          

COLUMN 2:       P49281

I have split the lines at the tab to get the individual columns. Then split the first 2 columns to get the required ID number ($prot_id). Then I have tried substituting the ID for the contents of columns 1 and 2. However the output in the changed file is not as I expect. It instead looks something like this:

  COLUMN1:                                           
Q96PU5|refseq:NP_056092|uniprotkb:Q96PU5    

COLUMN 2:
P49281|uniprotkb:P49281

Just the first part of the columns has been substituted. I have been playing around with this for hours and cannot figure out what I'm doing wrong. Any help greatly appreciated. My code is as follows:

#!/usr/bin/perl  

use开发者_StackOverflow中文版 warnings;
use strict;


my $file = 'DIP.txt';

open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt'); 
my @lines = <INFILE>;


foreach $_ (@lines) {
    my @columns = split('\t', $_);

            my $col1 = $columns[0];
            my $col2 = $columns[1];


            my @split_col1 = split ('uniprotkb:', $col1);
            my @split_col2 = split ('uniprotkb:', $col2);

            my $prot_id1 = $split_col1[length(@split_col1)];
            my $prot_id2 = $split_col2[length(@split_col2)];

            print $prot_id1, "\n";

             s/$col1/$prot_id1/;
             s/$col2/$prot_id2/;

            print {$outfile} $_; 
}



exit;

There's already some decent answers, but I'd like to show you a simpler solution. This script, you'd use like this:

$ script.pl DIP.txt > DIP_changed.txt

And the script itself is really just:

while (<>) {
    s/\S+uniprotkb:(\S+)/$1/;
    s/\S+uniprotkb:(\S+)/$1/;
    print;
}

It doesn't need to be more complicated than that.

Try something like this:

This is a neat Perl idiom - match a string on a regular expression like this

$columns[0]=~/:((\w|\d)*)$/;

(note that there are two atoms defined there with the parentheses) and assign the results of the matches (whatever is in the 1st, 2nd, and so on atoms) to an array - or to a set of scalar variables in an array list, like this:

($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;

See, you were on the right track but you were making it harder than it needed to be :)

#!/usr/bin/perl  

use warnings;
use strict;

my $file = 'DIP.txt';

open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt');


foreach my $line (<INFILE>) {
    print "The input line is $line\n";
    my @columns = split('\t', $line);

    ($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;
    ($columns[1]) = $columns[1]=~/:((\w|\d)*)$/;

    printf  "The output line is  %s\n", join ',', @columns;
    printf  $outfile join ',', @columns;

    }

ratsbane's answer was pretty good, but you probably want to know after hours of working why you got the answer you did. The reason is that $col1 had a pipe in it. That is an "OR" in a regex. So when you tried to substitute for the regex $col1, you were doing a find and replace over

dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5

Now as a regex, what does it match? It matches only

dip:DIP-41935N

so that is what got replaced!

Hope that helps!

There's probably no really good reason to slurp the file in at the beginning, rather than processing it line by line. Processing line by line will scale better. With that in mind, I would do it this way:

use warnings;
use strict;


my $file = 'DIP.txt';

open my $in_fh, '<', $file or die $!;
open my $out_fh, '>', 'new' . $file or die $!;

while ( <$in_fh> ) {
    chomp;
    next unless length $_; # Skip blank lines.
    my ( @columns ) = split /\s+/, $_; # Split on whitespace (you may prefer \t).
    foreach my $column ( @columns ) {
        ( $column ) = $column =~ m{([^:]+)$};
    }
    local $" = "\t";
    print $out_fh "@columns\n";
}

First, this uses the three arg version of open on both the input file and the output file. This is a good habit to get into. Next, it uses lexical filehandles instead of the old fileglob filehandles. Lexicals auto-close when they pass out of scope, and don't become part of the global symbol table.

Next, the script reads the file and process it line by line, to avoid slurping. This could be advantageous if the file potentially grows large, or if you're in an environment where memory usage is at a premium. Unless you have a good reason to slurp, may as well get in the habit of not doing so.

Then I split on whitespace. You could split on tabs. Unless there's embedded whitespace in the columns either way works. Then I iterate over the two columns, matching and capturing from each everything at the end of the column that is not a colon. Or another way of putting it, everything that comes after the last colon. I capture the result right back into the $column variable, which aliases the corresponding element in @columns. That way, when I'm done @columns only holds my captures.

Finally, after processing the two columns, we localize $", assigning to it a tab character. That way when we print the two columns by wrapping @columns in quotes, the interpolation automatically sticks a tab character between the columns again. If you prefer a different character, you now know where to change it.

Then the while loop moves on to the next line. Any blank lines will be skipped.

see perldoc open, perlretut, perlvar, and perlop for an explanation of three-arg open as well as lexical filehandles, an explanation of regexps, Perl's special variables such as $", and how quotish interpolation works.

Good question!

继续阅读：file perl split

Perl - Problem with splitting columns in tab delimited text file and replacing columns with new values

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？