开发者

Efficiency problem while building csv in perl

I've written a script in perl which takes an xml, parses it, and builds a csv. Taking the xml, and parsing it and sorting it seems to go very smoothly, but once I get into larger datasets (ie building a csv with 10000 rows and 260 columns) the script starts to take a huge amount of time (~1hr) while building the csv string. I understand that Perl probably isn't the best for string concatenation; but I would have thought that it would have been more efficient than this.

Basically for the sake of sorting, I have two hashes of arrays. One hash contains arrays which I used for sorting. The other hash contains arrays for all of the other columns (columns I want to write into the csv; but have no relevance to how I want to sort them). So my problem 开发者_如何学编程code (and I have verified this is the code block taking forever) code is like this:

  my $csv = "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
  foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    $csv .= $records[$index] ? "$records[$index]," : ",";
    $csv .= $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
    @records = @{$primaryFields{"Important Field 2"}};
    $csv .= $records[$index] ? "$records[$index]," : ",";
    foreach my $key (@keys) {
      @records = @{$csvContent{$key}};
      if($key eq $last) {
        $csv .= $records[$index] ? "$records[$index]\n" : "\n";
      } else {
        $csv .= $records[$index] ? "$records[$index]," : ",";
      }
    }
  }

I have also tried the same thing only using the join method instead of ".=". I've also tried foregoing the string aggregation all together and writing directly into a file. Both of these didn't seem to help that much. I'll be the first one to admit that my knowledge of memory management in perl probably isn't the greatest; so please feel free to school me (constructively). Also, if you think this is something I should consider rewriting outside of perl, please let me know.

EDIT: Some sample xml (please keep in mind I'm not in the position to edit the structure of the xml):

<fields>
  <field>
    <Name>IndicesToBeSorted</Name>
    <Records>idx12;idx14;idx18;...idxN-1;idxN</Records>
  </field>
  <field>
    <Name>Important Field1</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>Important Field2</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>Records...</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>More Records...</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
</fields>

The position of a record in one field corresponds to the position in another field. For example; the first item from each "Records" element is associated and makes up a column in my csv. So basically, my script parses all of this, and creates an array of ordered indices (which is what is in the @orderedIndecies in my example). The @orderdIndecies contains data like...

print "$orderedInecies[0]\n"  #prints index of location of idx0 
print "$orderedInecies[1]\n"  #prints index of location of idx1
print "$orderedInecies[2]\n"  #prints index of location of idx2
print "$orderedInecies[3]\n"  #prints index of location of idx3

I do things this way because the string from the orderedIndecies is out of order; and I didn't want to move all of the data around.

EDIT: FINAL ANSWER

  open my $csv_fh, ">", $$fileNameRef or die "$$fileNameRef: $!";
  print $csv_fh "Important Field 1,Index Field,Important Field 2";

  # Defining $comma, $endl, $empty allows me to do something like:
  #
  #                    print $csv_fh $val ? $val : $empty;
  #                    print $csv_fh $comma;
  # 
  # As opposed to....
  #
  #                    print $csv_fh $val ? "$val," : ",";
  #
  # Note, the first method avoids the string aggregation of "$val,"
  my $comma = ",";
  my $endl = "\n";
  my $empty = "";

  my @keys = sort(keys %csvContent);
  my $last = $keys[-1];
  foreach (@keys) {
    print $csv_fh $_;
    print $csv_fh $_ eq $last ? $endl : $comma;
  }

  # Even though the hash lookup is probably very efficient, I still
  # saw no need to redo it constantly, so I defined it here as 
  # opposed to inline within the for loops
  my @ImportantFields1 =  @{$primaryFields{"Important Field 1"}};
  my @ImportantFields2 =  @{$primaryFields{"Important Field 2"}};

  print "\n\n--------- BUILD CSV START ---------------\n\n";
  foreach my $index (@orderedIndecies) {
    print $csv_fh exists $ImportantFields1[$index] ? $ImportantFields1[$index] : $empty;
    print $csv_fh $comma;
    print $csv_fh $originalIndexField[$index] >= 0 ? $originalIndexField[$index] : $empty;
    print $csv_fh $comma;
    print $csv_fh exists $ImportantFields2[$index] ? $ImportantFields2[$index] : $empty;

    #If needed, this is where you would make sure to escape commas
    foreach my $key (@keys) {
      print $csv_fh $comma;
      $record = exists @{$csvContent{$key}}[$index] 
                     ? @{$csvContent{$key}}[$index];
                     : $empty;
    }
    print $csv_fh $endl;
  }  

  print "\n\n------- CSV Contents wrtten to file -----------\n\n"
  close($csv_fh);

Thanks for the help guys :D


I put together a small program that creates an array of 100,000 random strings and another of 260 random strings, and then looped over them, concatenating, printing out the time it's taken to get to that part of the loop. For the first 70,000 values concatonated, this is the time it's taking me with my program:

On Key 0, has been 00:00:00
On value 0, has been 00:00:00
On value 10000, has been 00:00:05.4373956
On value 20000, has been 00:00:22.3901951
On value 30000, has been 00:00:51.1552678
On value 40000, has been 00:01:31.0138775
On value 50000, has been 00:02:26.4659378
On value 60000, has been 00:03:32.6834164
On value 70000, has been 00:04:48.4788361

As you can see, concatenation itself is generally not a great way to handle this kind of operation. Even though you have a 'fixed' amount of data, the time it takes increases due to the memory copying needs.

The alternative is to do what is recommended in Perl - write it to disk! :) Don't do any concatenation at all - write each part, as you get it, to disk. My sample program that finished its last line at 23.2183042 seconds. It'd look something like this:

# a couple utilities

use File::Temp qw/tempdir/;
use File::Spec::Functions; # catdir

# ... and later

my $dir = tempdir();  # give me a temporary, empty place to put it
my $file = catdir($dir, 'temp.csv');

open my $fh, '>', $file
    or die "Can't open '$file' for write: $!";

print {$fh} "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    print $records[$index] ? "$records[$index]," : ",";
    print $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
    @records = @{$primaryFields{"Important Field 2"}};
    print $records[$index] ? "$records[$index]," : ",";
    foreach my $key (@keys) {
        @records = @{$csvContent{$key}};
        if($key eq $last) {
            print {$fh} $records[$index] ? "$records[$index]\n" : "\n";
        } else {
            print {$fh} $records[$index] ? "$records[$index]," : ",";
        }
    }
}

close $fh
    or warn "Can't close '$file' for some reason: $!";

Now there is no concatenation, no memory copying. It should go really, really fast.


If you try using the following instead of hand rolled code, you will see at least some speed increase and I'd expect it to be pretty dramatic. Text::CSV_XS for the CSV writing (and reading if you need it) and XML::LibXML for the XML parsing. They each use C under the hood.

String concatenation can be a sign of trouble for robustness as well as speed.


Is it necessary to assemble the whole output into the one variable $csv? A saner approach would be to use an array containing, say, one element for each record. Then you can just print the array to your output stream, or if you insist, using join to concatenate the array elements into a single scalar.

Within each record, you can also assemble the individual fields into an array, and join them at the end

  my @csv = ("Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n");
  foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    my @newRow = ();
    push @newRow, $records[$index] ? $records[$index] : "";
    # alternatively:  push @newRow, $records[$index] || "";
    push @newRow, $primaryIndex[$index]>=0 ? $primaryIndex[$index] : "";
    @records = @{$primaryFields{"Important Field 2"}};
    push @newRow, $records[$index] ? $records[$index] : "";
    foreach my $key (@keys) {
      @records = @{$csvContent{$key}};
      push @newRow, $records[$index] ? $records[$index] : "";
    }
    push @csv, join(",", @newRow) . "\n";
  }

  ...
  $csv = join '', @csv;      # if necessary
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜