开发者

perl text::csv - filtering specific columns in a csv document and discarding others

I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:

date mem_total cpu.usagemhz.average_0 cpu.usagemhz.average_1 cpu.usagemhz.average_2

I would like to capture only columns that begin with "cpu.usage.mhz.average"

Is their a particular function of text::csv that will help me do a quick check of the column names?

Thanks! JD

* Update **

I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!

Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;

use Text::CSV;

my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";

my $lineCount = 0;
my %desiredColumns;
while(<F>) {
  $lineCount++;
  my $csv = Text::CSV->new();
  my $status = $csv->parse($_); # should really check this!
  my @fields = $csv->fields();
  my $colCount = 0;

  if ($lineCount == 1) {
    # Let's look at the column headings.
    foreach my $field (@fields) {
      $colCount++;
      if ($field =~ m/$pattern/) {
        # This heading matches, save the column #.
        $desiredColumns{$colCount} = 1;
      }
    }
  }
  else {
    # Not the header row.  Parse the body of the file.
    foreach my $field (@fields) {
      $colCount++;
      if (exists $desiredColumns{$colCount}) {
        # This is one of the desired columns.
        # Do whatever you want to do with this column!
        print "$colCount\t$field\n";
      }
    }
  }
}
close(F);

Here is the results

co开发者_JAVA技巧lCount |  $field

12      565
13      73
14      36
15      32
16      127
17      40
18      32
19      42
20      171
12      464
13      62
14      32
15      24
16      109
17      21
18      19
19      39
20      150
12      515
13      76
14      28
15      30
16      119
17      15
18      25
19      46
20      169
12      500
13      71
14      30
15      28
16      111
17      20
18      18
19      40
20      167

I would like to add this data to individual arrays or hashes. what do you think? something like...

foreach column { check to see if a hash already exists with that column number. If not then create hash. }

Then go through each field and add the field data to the appropriate hash.

Do you think this is the right way to go about solving this?


No, not a specific function in Text::CSV. I would do something like this:

use Text::CSV;

my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";

my $lineCount = 0;
my %desiredColumns;
my %columnContents;

while(<F>) {
  $lineCount++;
  my $csv = Text::CSV->new();
  my $status = $csv->parse($_); # should really check this!
  my @fields = $csv->fields();
  my $colCount = 0;

  if ($lineCount == 1) {
    # Let's look at the column headings.
    foreach my $field (@fields) {
      $colCount++;
      if ($field =~ m/$pattern/) {
        # This heading matches, save the column #.
        $desiredColumns{$colCount} = 1;
      }
    }
  }
  else {
    # Not the header row.  Parse the body of the file.
    foreach my $field (@fields) {
      $colCount++;
      if (exists $desiredColumns{$colCount}) {
        # This is one of the desired columns.
        # Do whatever you want to do with this column!
        push(@{$columnContents{$colCount}}, $field);
      }
    }
  }
}
close(F);

foreach my $key (sort keys %columnContents) {
  print "Column $key: " . join(",", @{$columnContents{$key}}) . "\n\n";
}

Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...


Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.

use Text::CSV;                                     # load the module
my $csv = Text::CSV->new ();                       # instantiate
open $fh, "<somefile";                             # open the input
while ( my $fields = $csv->getline($fh) ) {        # read a line, and parse it into fields
    print "I got @{$fields}[2..4]\n";              # print the fields of interest
}
close ($fh)                                        # close when done


WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?

If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.

If the former, there are many methods, some are more efficient than the others.

Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).


If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:

  • For the former (roll your own), you can do something like:

    my $headers = $csv->getline( $fh ); # First line is headers.
    my @headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } @$headers;
    while ( my $row = $csv->getline( $fh ) ) {
        my $i = 0;
        my @row_new = grep { $headers_keep[$i++] } $@row;
        push @rows, \@row_new;
    }
    

    BUT you can either roll your own OR .

  • You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \

    In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().

    You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜