help with perl code to parse a file

2023-03-17 06:24 问答作者：

I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/) part of the subroutine get_number is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/. Finally, is the while loop in the get_number subroutine necessary?

#!/usr/bin/env perl

use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

# store the name of all the OCR fi开发者_开发百科le names in an array
my @file_list=qw{
   blah.txt
};

# set the scalar index to zero
my $file_index=0;

# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
    or die "Can't open output file\n";

while($file_index < 1){
    # open the OCR file and store it in the filehandle IN_FILE
    open(IN_FILE, '<', "$file_list[$file_index]")
        or die "Can't read source file!\n";

    print "Processing file $file_list[$file_index]\n";
    while(<IN_FILE>){
            my $citing_pat=get_number();
            get_country($citing_pat);
    }
    $file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;

The definition of get_number is below.

sub get_number {
    while(<IN_FILE>){
        if(/DID/){
            my @fields=split / /;
            chomp($fields[3]);
            if($fields[3] !~ /\D/){
                return $fields[3];
            }
        }
    }
}

Perl has a variable $_ that is sort of the default dumping ground for a lot of things.

In get_number, while(<IN_FILE>){ is reading a line into $_, and the next line is checking if $_ matches the regular expression DID.

It's also common to see chomp; which also operates on $_ when no argument is given.

In that case, if (/DID/) by default searches the $_ variable, so it is correct. However, it is a rather loose regex, IMO.

The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.

The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.

In order to answer that question properly, we'd need to see the input file.

There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.

Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country, and will simply do nothing if it does not find a suitable number.

use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

my @file_list=qw{ blah.txt };

open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";

for my $file (@file_list) {
    open(my $in_file, '<', $file) or die "Can't read source file: $!";
    print "Processing file $file\n";
    while (my $citing_pat = get_number($in_file)) {
        get_country($citing_pat);
    }
}
close $out_file;

sub get_number {
    my $fh = shift;
     while(<$fh>) {
            if (/DID/) {
                    my $field = (split)[3];
                    if($field =~ /^\d+$/){
                return $field;
                    }
            }
     }
    return undef;
}

继续阅读：file-io parsing perl regex screen-scraping

help with perl code to parse a file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？