help with perl code to parse a file
I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/)
part of the subroutine get_number
is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/
. Finally, is the while loop in the get_number
subroutine necessary?
#!/usr/bin/env perl
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
# store the name of all the OCR fi开发者_开发百科le names in an array
my @file_list=qw{
blah.txt
};
# set the scalar index to zero
my $file_index=0;
# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
or die "Can't open output file\n";
while($file_index < 1){
# open the OCR file and store it in the filehandle IN_FILE
open(IN_FILE, '<', "$file_list[$file_index]")
or die "Can't read source file!\n";
print "Processing file $file_list[$file_index]\n";
while(<IN_FILE>){
my $citing_pat=get_number();
get_country($citing_pat);
}
$file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;
The definition of get_number
is below.
sub get_number {
while(<IN_FILE>){
if(/DID/){
my @fields=split / /;
chomp($fields[3]);
if($fields[3] !~ /\D/){
return $fields[3];
}
}
}
}
Perl has a variable $_
that is sort of the default dumping ground for a lot of things.
In get_number
, while(<IN_FILE>){
is reading a line into $_
, and the next line is checking if $_
matches the regular expression DID
.
It's also common to see chomp;
which also operates on $_
when no argument is given.
In that case, if (/DID/)
by default searches the $_
variable, so it is correct. However, it is a rather loose regex, IMO.
The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.
The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.
In order to answer that question properly, we'd need to see the input file.
There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.
Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country
, and will simply do nothing if it does not find a suitable number.
use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
my @file_list=qw{ blah.txt };
open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";
for my $file (@file_list) {
open(my $in_file, '<', $file) or die "Can't read source file: $!";
print "Processing file $file\n";
while (my $citing_pat = get_number($in_file)) {
get_country($citing_pat);
}
}
close $out_file;
sub get_number {
my $fh = shift;
while(<$fh>) {
if (/DID/) {
my $field = (split)[3];
if($field =~ /^\d+$/){
return $field;
}
}
}
return undef;
}
精彩评论