开发者

Looping through a dataset and handling missing values

I am looping through a big data file and like to detect the type of variable in each column, eg if it is an Intenger or a Float etc. It works perfectly fine, however, at the moment it is still very basic and I like to add another idea. So far the declaration of the variable is based on the second row of the data set. (The first one is used as the header.) Here is the beginning of the code:

#!/usr/bin/perl

use warnings;
use diagnostics;
use Getopt::Std;

getopts("i:s:t:") or die "bad options: $!";

if($opt_i) {
open INFILE, "< $opt_i";
chomp($headerline = <INFILE>);
$second = <INFILE>;
} else {
die "the input file has to be given\n";
}

if($opt_t) {
$tablename = $opt_t;
} else {
$tablename = $opt_i;
$tablename =~ s/\.\w+//;
}

if($opt_s) {
$sep = $opt_s;
} else {
$sep = ",";
}

$headerline =~ s/\"//g;
$headerline =~ s/\./\_/g;
@header = split/$sep/, $headerline;

$second =~ s/\"//g ;
@second = split/$sep/, $second;
@terms = split/$sep/, $second;
@types = split/$sep/, $second;

And now I have implemented a small loop. The problem is that I don't know how to handle the missing values which are declared with NULL. At the moment the loop simply assigns "" i.e. nothing to the variable $vartype[$j].

$j = 0;
while($j <= $#second) {
if ($types[$j] =~ /NULL/) {
$vartype[$j] = "";
} elsif($types[$j] =~ /[A-Za-z]/) {
$vartype[$j] = "varchar";
} elsif ($types[$j] =~ /\./) {
$vartype[$j] = "double";
} else {
$vartype[$j] = "int";
}
$j++;
}

So how can I implement another loop structure into the existing loop so that whenever I have a NULL value in one column the loop reads the next value in that same column and does so until he finds a number or a word.

A sample of my data would be eg:

Country.Name        Time.Name  AG.LND.AGRI.ZS   NY.GDP.MKTP.CD   NE.IMP.GNFS.ZS
Brunei Darussalam   1960       NULL             1139121335.16    3.46
Brunei Darussalam   1960       NULL             1677595756.64    0.9
Brunei Darussalam   1960       NULL             1488339328.59    4.19
Brunei Darussalam   1961       3.98             1869828587.8     3.14
Brunei Darussalam   1961       3.98             2346769422.22    3.38
Brunei Darussalam   1961       3.98             2363109706.3     3.17

As already mentioned the 开发者_开发技巧for loop only uses the second row to decide on the type of the variable.

Now I would like to implement another loop so that eg in the third column (AG.LND.AGRI.ZS) he goes through the column until he detects the first real value, in this case 3.98. At the moment the loop recognizes the missing value marked with NULL and just assigns an empty value.


Stop programming like C.

for my $variable (@types) {
  if ($variable =~ /NULL/) {
    push(@vartype, undef);
  }
  elsif ($variable =~ /[A-Za-z]/) {
    push(@vartype, "varchar");
  }
  elsif ($variable =~ /\./) {
    push(@vartype, "double";
  }
  else {
    push(@vartype, "int");
  }
}

Although, for perl, you should really be storing related data in a datastructure of hashes. Something like:

my $data = [ { value => 'NULL', type => undef },
             { value => 'a string', type => 'varchar' },
             { value => 9.5, type => 'double'},
             { value => 30, type => 'int'},
           ];


I am having a hard time figuring out what you are trying to do. Assuming you are trying to guess column types based on column contents, here is a way to do it. The important thing to do is not to set anything when the field is NULL, skip a field if you have already decided its type, and get out of the loop once all field types have been determined.

#!/usr/bin/perl

use strict; use warnings;
use Scalar::Util qw(looks_like_number);

my @names = split ' ', scalar <DATA>;
my @types;

while ( <DATA> ) {
    chomp;
    my @values = split / {2,}/;

    for my $i ( 0 .. $#values ) {
        next if defined $types[$i];
        my $val = $values[$i];
        next if $val eq 'NULL';
        if ( $val =~ /^[0-9]+\z/ ) {
            $types[$i] = 'int';
        }
        elsif ( $val =~ /^[0-9.]+\z/
                and looks_like_number($val) ) {
            $types[$i] = 'double';
        }
        else {
            $types[$i] = 'varchar';
        }
    }
    last unless grep { not defined } @types;
}

print "$_\n" for @types;


__DATA__
Country.Name        Time.Name  AG.LND.AGRI.ZS   NY.GDP.MKTP.CD   NE.IMP.GNFS.ZS
Brunei Darussalam   1960       NULL             1139121335.16    3.46
Brunei Darussalam   1960       NULL             1677595756.64    0.9
Brunei Darussalam   1960       NULL             1488339328.59    4.19
Brunei Darussalam   1961       3.98             1869828587.8     3.14
Brunei Darussalam   1961       3.98             2346769422.22    3.38
Brunei Darussalam   1961       3.98             2363109706.3     3.17

Output:

varchar
int
double
double
double
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜