Is there any better way to "grep" from a large file than using `grep` in perl?

2023-02-19 00:17 问答作者：

The $rvsfile is the path of a file about 200M. I want to count the number of line which has $userid in it. But using grep in a while loop seems very slowly. So is there any efficient way to do this? Because the $rvsfile is very large, I can't read it into memory using @tmp = <FILEHANDLE>.

while(defined($line = <SRCFILE>))
{
    $li开发者_如何转开发ne =~ /^([^\t]*)\t/;
    $userid = $1;
    $linenum = `grep '^$userid\$' $rvsfile | wc -l`;
    chomp($linenum);
    print "$userid $linenum\n";
    if($linenum == 0)
    {
        print TARGETFILE "$line";
    }
}

And how can I get the part before \t in a line without regex? For example, the line may like this:

2013123\tsomething

How can I get 2013123 without regex?

Yes, you are forking a shell on each loop invocation. This is slow. You also read the entire $rsvfile once for every user. This is too much work.

Read SRCFILE once and build a list of @userids.
Read $rvsfile once keeping a running count of each user id as you go.

Sketch:

my @userids;

while(<SRCFILE>)
{
    push @userids, $1 if /^([^\t]*)\t/;
}

my $regex = join '|', @userids;
my %count;

while (<RSVFILE>)
{
     ++$count{$1} if /^($regex)$/o
}

# %count has everything you need...

Use hashes:

my %count;
while (<LARGEFILE>) {
    chomp;
    $count{$_}++;
};
# now $count{userid} is the number of occurances 
# of $userid in LARGEFILE

Or if you fear using too much memory for the hash (i.e. you're interested in 6 users, and there are 100K more in the large file), do it another way:

my %count;
while (<SMALLFILE>) {
    /^(.*?)\t/ and $count{$_} = 0;
};

while (<LARGEFILE>) {
    chomp;
    $count{$_}++ if defined $count{$_};
};
# now $count{userid} is the number of occurances 
# of $userid in LARGEFILE, *if* userid is in SMALLFILE

You can search for the location of the first \t using index which will be faster. You could then use splice to get the match.

Suggest you benchmark various approaches.

If I read you correctly you want something like this:

#!/usr/bin/perl

use strict;
use warnings;

my $userid = 1246;
my $count = 0;

my $rsvfile = 'sample';

open my $fh, '<', $rsvfile;

while(<$fh>) {
  $count++ if /$userid/;
}

print "$count\n";

or even, (and someone correct me if I am wrong, but this don't think this reads the whole file in):

#!/usr/bin/perl

use strict;
use warnings;

my $userid = 1246;

my $rsvfile = 'sample';

open my $fh, '<', $rsvfile;

my $count = grep {/$userid/} <$fh>;

print "$count\n";

If <SRCFILE> is relatively small, you could do it the other way round. Read in the larger file one line at a time, and check each userid per line, keeping a count of each userid using a hash sructure. Something like:

my %userids = map {($_, 0)}                # use as hash key with init value of 0
              grep {$_}                    # only return mataches
              map {/^([^\t]+)/} <SRCFILE>; # extract ID

while (defined($line = <LARGEFILE>)) {
    for (keys %userids) {
        ++$userids{$_} if $line =~ /\Q$_\E/; # \Q...\E escapes special chars in $_
    }
}

This way, only the smaller data is read repeatedly and the large file is scanned once. You end up with a hash of every userid, and the value is the number of lines it occurred in.

if you have a choice, try it with awk

awk 'FNR==NR{a[$1];next} { for(i in a) { if ($0 ~ i) { print $0} } } ' $SRCFILE $rsvfile

继续阅读：perl regex

Is there any better way to "grep" from a large file than using `grep` in perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？