开发者

Help writing flexible splits, perl

A couple weeks ago I posted a question about trouble I was having parsing an irregularly-formatted data file. Here's a sample of the data:

01-021412 15/02/2007  207,000.00 14,839.00  18       -6     2     6     6     5    16     6     4     4     3   -28   -59   -88  -119
                                                     -149  -191  -215  -246             
     Atraso Promedio --->        2.88

I need a program that would extract 01-021412, 18, count and sum all the digits in the subsequent series, and store atraso promedio, and that could repeat this operation for over 40,000 entires. I received a very helpful response, and from that was able to write the code:

use strict;
use warnings;

#Create an output file
open(OUT, ">outFull.csv");
print OUT "loanID,nPayments,atrasoPromedio,atrasoAlt,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72\n";

open(MYINPUTFILE, "<DATOS HISTORICO ASPIRE2.txt");

my @payments;
my $numberOfPayments;
my $loanNumber;

while(<MYINPUTFILE>)
{
    if(/\b\d{2}-\d{6}\b/)
    {
        ($loanNumber, undef, undef, undef, $numberOfPayments, @payments) = split;
    }
    elsif(m/---> *(\d*.\d*)/)
    {
        my (undef, undef, undef, $atrasoPromedio) = split;
        my $N = scalar @payments;
        print "$numberOfPayments,$N,$loanNumber\n";

        if($N==$numberOfPayments){

        my $total = 0; 
        ($total+=$_) for @payments; 

        my $atrasoAlt = $total/$N; 

        print OUT "$loanNumber,$numberOfPayments,$atrasoPromedio,$atrasoAlt,",join( ',', @payments),"\n";
       }
    }
    else
    {
        push(@payments, split);
    }
}

This would work 开发者_如何学Gofine, except for the fact that about 50 percent of entries include an '*' as follows:

* 01-051948 06/03/2009  424,350.00 17,315.00  48        0     6    -2     0    21    10     9    13    10     9     7    13     3     4
                                                        12    -3    14     8     6
       Atraso Promedio --->        3.02

The asterisk causes the program to fail because it interrupts the split pattern, causing incorrect variable assignments. Until now I've dealt with this by removing the asterisks from the input data file, but I just realized that by doing this the program actually omits these loans altogether. Is there an economical way to modify my script so that it handles entries with and without asterisks?

As an aside, if an entry does include an asterisk I would like to record this fact in the output data.

Many thanks in advance, Aaron


Use an intermediate array:

my $has_asterisk;

# ...

if(/\b\d{2}-\d{6}\b/)
{
    my @fields = split;
    $has_asterisk = $fields[0] eq '*';
    shift @fields if $has_asterisk;
    ($loanNumber, undef, undef, undef, $numberOfPayments, @payments) = @fields;
}


You could discard the asterisk before doing the split :

while(<MYINPUTFILE>) {
    s/^\s*\*\s*//;

    if(/\b\d{2}-\d{6}\b/) {
        ($loanNumber, undef, undef, undef, $numberOfPayments, @payments) = split;
    ...    

And, apart of this, you should use 3 args open, lexical filehandles and test open for failure.

my $file = 'DATOS HISTORICO ASPIRE2.txt';
open my $MYINPUTFILE, '<', $file or die "unable to open '$file' for reading : $!";


so it looks like your first if statement regex is not accounting for that '*', so how about we modify it. my perl regex skillz are a little rusty, note that this is untested.

if(/(?:\* )?\b\d{2}-\d{6}\b/)

* is a modifier meaning "zero or more times" so we need to escape it, \*

(?: ) means "group this together but don't save it", I just use that so I can apply the ? to both the space and * at the same time


At beginning of the while loop, try this:

...
while(<MYINPUTFILE>)
{
    my $asterisk_exists = 0;
    if (s/^\* //) {
       $asterisk_exists = 1;
    }
...

In addition to removing the asterisk by using the s/// function, you also keep track of whether or not the asterisk was there in the first place. With the asterisk removed, the rest of your script should function as normal.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜