开发者

How can I retrieve the N-th line from a text file using Perl?

How can I print 1st, 10th, 20th... lines (not array index) number in a long list of text. Of course, the following doesn't work:

for(my $i=0; $i<开发者_如何学C;=$arr_size; $i+=10){
    print $arr[$i],"\n";
}


If you are reading from a filehandle:

while (my $line = <$fh>) {
    if ($. == 1 or not $. % 10) {
        print $line;
    }
}

If you have a scalar that holds a bunch of lines like:

my $s = join "", map { "$_\n" } "a" .. "z";

Then you can treat the scalar like a file by passing a reference to it during an open:

open my $fh, "<", \$s
    or die "could not open in-memory file: $!";

and then use the solution above.

Putting it all together, you get

#!/usr/bin/perl

use strict;
use warnings;

my $s = join "", map { "$_\n" } "a" .. "z";

open my $fh, "<", \$s
    or die "could not open in-memory file: $!";

while (my $line = <$fh>) {
    if ($. == 1 or not $. % 10) {
        print "$. $line";
    }
}

Note, this trick only works if you have built perl with PerlIO on, but that has been the default since Perl 5.8. You will need to grab IO::Scalar from CPAN if your version of perl wasn't compiled with Perl IO.

For truly insane levels of weirdness, you could use Tie::File on the in-memory file:

#!/usr/bin/perl

use strict;
use warnings;

use Tie::File;

my $s = join "", map { "$_\n" } "a" .. "z";

open my $fh, "<", \$s
    or die "could not open in-memory file: $!";

tie my @lines, "Tie::File", $fh
    or die "could not tie in-memory file: $!";

my $i = 0;
while (defined $lines[$i]) {
    print "$lines[$i]\n";
} continue {
    $i += 10;
}


Here's how you'd do it with a regex taking advantage of the /g modifier.

my $count = 0;
my @found;
while($text =~ /\G(.*)\n/g) {
    next if $count++ % 10 != 0;

    push @found, $1;
}

I bench it at about about 50% faster than Chas' scalar ref filehandle solution for small strings of less than 100 lines, but at 1000 lines and up it levels off to just 20% faster.

Chas' filehandle solution is safer (if you write the regex wrong you can have yourself an infinite loop), simpler, and not significantly slower nor use more memory. Use that.


Here's a benchmark using my solution of a simple filehandle read versus Schwern's regex and Chas.'s tie-ing.

This is Perl 5.12.2 running on my Mac Pro:

                 Rate Chas. Chas. modified drewk Schwern Chas. sane drewk2 brian
Chas.          70.0/s    --           -33%  -94%    -94%       -95%   -95%  -96%
Chas. modified  104/s   48%             --  -91%    -91%       -92%   -93%  -94%
drewk          1163/s 1560%          1019%    --     -5%       -15%   -23%  -35%
Schwern        1220/s 1641%          1073%    5%      --       -11%   -20%  -32%
Chas. sane     1370/s 1856%          1218%   18%     12%         --   -10%  -23%
drewk2         1515/s 2064%          1358%   30%     24%        11%     --  -15%
brian          1786/s 2450%          1618%   54%     46%        30%    18%    --

This is Perl 5.10.1 on the same machine:

                 Rate Chas. Chas. modified drewk Schwern Chas. sane drewk2 brian
Chas.          66.9/s    --           -35%  -94%    -95%       -95%   -96%  -96%
Chas. modified  103/s   54%             --  -91%    -92%       -93%   -93%  -94%
drewk          1111/s 1560%           981%    --    -17%       -22%   -27%  -40%
Schwern        1333/s 1892%          1197%   20%      --        -7%   -12%  -28%
Chas. sane     1429/s 2034%          1290%   29%      7%         --    -6%  -23%
drewk2         1515/s 2164%          1374%   36%     14%         6%     --  -18%
brian          1852/s 2667%          1702%   67%     39%        30%    22%    --

These results don't surprise me that much. Tie::File seems slower than it should be, but I expected it to be slow. It's nifty, but I find Tie::File is often poor trade-off in performance for a nice interface to something that wasn't that hard to start with. It's nice if you need random and repeated access, but for a single pass sequential access it's the wrong tool. Chas. does a bit more work than I think he really needs in that example. We know the indices of the lines that we want, so we can just take a slice of the tied array. The slice is about 150% faster than the while loop looking at every line.

To see an extreme result, I replicated the lines by 1,000 times (so, about 1,300,000 lines in the file):

 $scalar = slurp( $file ) x 1000;

These are the results for the big file on Perl 5.12.2:

                  Rate Chas. Chas. modified drewk drewk2 Schwern Chas. sane brian
Chas.          0.695/s    --           -32%  -91%   -94%    -94%       -95%  -96%
Chas. modified  1.02/s   46%             --  -86%   -91%    -92%       -93%  -94%
drewk           7.38/s  962%           626%    --   -34%    -39%       -47%  -59%
drewk2          11.2/s 1512%          1002%   52%     --     -7%       -19%  -38%
Schwern         12.1/s 1635%          1086%   63%     8%      --       -13%  -33%
Chas. sane      13.9/s 1896%          1264%   88%    24%     15%         --  -23%
brian           18.0/s 2495%          1674%  144%    61%     50%        30%    --

drewk's solutions creating new arrays show their scaling problem now. Since they aren't any simpler than the other solutions and they have this big drawback, there's no reason to do it that way.

Here's my benchmark program. There's a very slight difference in the programs. My solution (and Chas.'s first solution) gets the 1st, 10th, 20th, and so on lines as noted in the question text. The other solutions get the 1st, 11th, 21st and so on lines as noted in the broken code. That doesn't really matter for the benchmark though.

#!perl
use strict;
use warnings;

use File::Slurp qw(slurp);
use Tie::File;
use Benchmark qw(cmpthese);
use vars qw($scalar);

chomp( my $file = `perldoc -l perlfaq5` );
#$file = '/Users/brian/Desktop/lines';
print "file is $file\n";
$scalar = slurp( $file );

cmpthese( 1000, {
    'Chas.'          => \&chas,
    'Schwern'        => \&schwern,
    'brian'          => \&brian,
    'Chas. modified' => \&chas_modified,
    'Chas. sane'     => \&chas_sane,
    'drewk'          => \&drewk,
    'drewk2'         => \&drewk2,
    });

sub drewk {
   my @arr = split(/\n/, $scalar);
   my @found;
   for(my $i=0; $i<=$#arr; $i+=10){
    #  print "drewk[$i] $arr[$i]\n";
      push @found, $arr[$i];
    }
}
sub drewk2 {
   my $i=0;
   my @found;
   foreach(split(/\n/, $scalar)) {
      next if $i++ % 10;
#      print "drewk2[$i] $_\n";
      push @found, $_;
   }
}
sub schwern {
    my $count = 0;
    my @found;
    while($scalar =~ /\G(.*)\n/g) {
        next if $count++ % 10 != 0;
#       print "schwern[$count] $1\n";
        push @found, $1;
        }
    }

sub chas {
    open my $fh, "<", \$scalar;

    tie my @lines, "Tie::File", $fh
        or die "could not tie in-memory file: $!";

    my $i = 0;
    my @found = ();
    while (defined $lines[$i]) {
        # print "chas[$i]: $lines[$i]\n";
        push @found, $lines[$i];
        } continue {
            $i += 10;
        }   
    }

sub chas_modified {
    open my $fh, "<", \$scalar;

    tie my @lines, "Tie::File", $fh
        or die "could not tie in-memory file: $!";

    my $highest_multiple = int( $#lines / 10 ) ;
    my @found = @lines[ map { $_ * 10  - ($_?1:0) } 0 .. $highest_multiple ]; 
    #print join "\n", @found;
    }

sub chas_sane {
    open my $fh, "<", \$scalar;

    my @found;
    while (my $line = <$fh>) {
        if ($. == 1 or not $. % 10) {
            #print "chas_sane[$.] $line";
            push @found, $_;
            }
        }
    }

sub brian {
    open my $fh, '<', \$scalar;
    my @found = scalar <$fh>;
    while( <$fh> ) {
        next if $. % 10;
        #print "brian[$.] $_";
        push @found, $_;
        }
    }


If Schern's comment is correct that your "list of text" means its in a $scalar one way to fix that is with Perl's split You can then use the code that you have written thus:

sub drewk {
   my @arr = split(/\n/, $scalar);
   for(my $i=0; $i<=$#arr; $i+=10){
       #print $arr[$i],"\n";
    }
}

Rather than use a C style loop, you can write very readable Perl idiom to do the same thing that is also faster:

sub drewk2 {
   my $i=0;
   my @found;
   foreach(split(/\n/, $scalar)) {
      next if $i++ % 10;
      #print "$_\n";
      push @found, $_;
   }
}

Plugging those into brian's benchmark, you get very competitive result:

                 Rate    Chas. Chas. modified Schwern    drewk    brian   drewk2
Chas.          86.1/s       --           -37%    -95%     -95%     -96%     -96%
Chas. modified  136/s      59%             --    -92%     -92%     -93%     -94%
Schwern        1695/s    1869%          1142%      --      -3%     -14%     -22%
drewk          1754/s    1939%          1186%      4%       --     -11%     -19%
brian          1961/s    2178%          1337%     16%      12%       --     -10%
drewk2         2174/s    2426%          1493%     28%      24%      11%       --

(this on a iMac 2.93 GHz Intel COre i7 with Perl 5.10)

You didn't post the context of code leading up to your posted loop. Perhaps you did something like this:

   $scalar="line 1\nline 2\n ... line n";

   push @arr, $scalar;
   #or
   $arr[0]=$scalar;

thinking that the \n would cause the lines to end up in different array elements? Post context next time...

----Edit:

The original post states How can I print 1st, 10th, 20th... lines (not array index) number in a long list of text. If by "long list of text" you mean megbytes and gigabytes, use Brian's or Chas' file handle approach. It is slick, fast, and the data will not be duplicated in memory. If "long list of text" is something of a size where RAM is plentiful, you can use split, /\n/g, etc or whatever seems to make sense to you and the data.


my $lineno = 10;
open FILE, "filename.txt";
my @arr = <FILE>;
print $arr[$lineno];
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜