Efficiency problem while building csv in perl

2023-02-04 14:36 问答作者：

I've written a script in perl which takes an xml, parses it, and builds a csv. Taking the xml, and parsing it and sorting it seems to go very smoothly, but once I get into larger datasets (ie building a csv with 10000 rows and 260 columns) the script starts to take a huge amount of time (~1hr) while building the csv string. I understand that Perl probably isn't the best for string concatenation; but I would have thought that it would have been more efficient than this.

Basically for the sake of sorting, I have two hashes of arrays. One hash contains arrays which I used for sorting. The other hash contains arrays for all of the other columns (columns I want to write into the csv; but have no relevance to how I want to sort them). So my problem 开发者_如何学编程code (and I have verified this is the code block taking forever) code is like this:

  my $csv = "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
  foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    $csv .= $records[$index] ? "$records[$index]," : ",";
    $csv .= $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
    @records = @{$primaryFields{"Important Field 2"}};
    $csv .= $records[$index] ? "$records[$index]," : ",";
    foreach my $key (@keys) {
      @records = @{$csvContent{$key}};
      if($key eq $last) {
        $csv .= $records[$index] ? "$records[$index]\n" : "\n";
      } else {
        $csv .= $records[$index] ? "$records[$index]," : ",";
      }
    }
  }

I have also tried the same thing only using the join method instead of ".=". I've also tried foregoing the string aggregation all together and writing directly into a file. Both of these didn't seem to help that much. I'll be the first one to admit that my knowledge of memory management in perl probably isn't the greatest; so please feel free to school me (constructively). Also, if you think this is something I should consider rewriting outside of perl, please let me know.

EDIT: Some sample xml (please keep in mind I'm not in the position to edit the structure of the xml):

<fields>
  <field>
    <Name>IndicesToBeSorted</Name>
    <Records>idx12;idx14;idx18;...idxN-1;idxN</Records>
  </field>
  <field>
    <Name>Important Field1</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>Important Field2</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>Records...</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>More Records...</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
</fields>

The position of a record in one field corresponds to the position in another field. For example; the first item from each "Records" element is associated and makes up a column in my csv. So basically, my script parses all of this, and creates an array of ordered indices (which is what is in the @orderedIndecies in my example). The @orderdIndecies contains data like...

print "$orderedInecies[0]\n"  #prints index of location of idx0 
print "$orderedInecies[1]\n"  #prints index of location of idx1
print "$orderedInecies[2]\n"  #prints index of location of idx2
print "$orderedInecies[3]\n"  #prints index of location of idx3

I do things this way because the string from the orderedIndecies is out of order; and I didn't want to move all of the data around.

EDIT: FINAL ANSWER

  open my $csv_fh, ">", $$fileNameRef or die "$$fileNameRef: $!";
  print $csv_fh "Important Field 1,Index Field,Important Field 2";

  # Defining $comma, $endl, $empty allows me to do something like:
  #
  #                    print $csv_fh $val ? $val : $empty;
  #                    print $csv_fh $comma;
  # 
  # As opposed to....
  #
  #                    print $csv_fh $val ? "$val," : ",";
  #
  # Note, the first method avoids the string aggregation of "$val,"
  my $comma = ",";
  my $endl = "\n";
  my $empty = "";

  my @keys = sort(keys %csvContent);
  my $last = $keys[-1];
  foreach (@keys) {
    print $csv_fh $_;
    print $csv_fh $_ eq $last ? $endl : $comma;
  }

  # Even though the hash lookup is probably very efficient, I still
  # saw no need to redo it constantly, so I defined it here as 
  # opposed to inline within the for loops
  my @ImportantFields1 =  @{$primaryFields{"Important Field 1"}};
  my @ImportantFields2 =  @{$primaryFields{"Important Field 2"}};

  print "\n\n--------- BUILD CSV START ---------------\n\n";
  foreach my $index (@orderedIndecies) {
    print $csv_fh exists $ImportantFields1[$index] ? $ImportantFields1[$index] : $empty;
    print $csv_fh $comma;
    print $csv_fh $originalIndexField[$index] >= 0 ? $originalIndexField[$index] : $empty;
    print $csv_fh $comma;
    print $csv_fh exists $ImportantFields2[$index] ? $ImportantFields2[$index] : $empty;

    #If needed, this is where you would make sure to escape commas
    foreach my $key (@keys) {
      print $csv_fh $comma;
      $record = exists @{$csvContent{$key}}[$index] 
                     ? @{$csvContent{$key}}[$index];
                     : $empty;
    }
    print $csv_fh $endl;
  }  

  print "\n\n------- CSV Contents wrtten to file -----------\n\n"
  close($csv_fh);

Thanks for the help guys :D

I put together a small program that creates an array of 100,000 random strings and another of 260 random strings, and then looped over them, concatenating, printing out the time it's taken to get to that part of the loop. For the first 70,000 values concatonated, this is the time it's taking me with my program:

On Key 0, has been 00:00:00
On value 0, has been 00:00:00
On value 10000, has been 00:00:05.4373956
On value 20000, has been 00:00:22.3901951
On value 30000, has been 00:00:51.1552678
On value 40000, has been 00:01:31.0138775
On value 50000, has been 00:02:26.4659378
On value 60000, has been 00:03:32.6834164
On value 70000, has been 00:04:48.4788361

As you can see, concatenation itself is generally not a great way to handle this kind of operation. Even though you have a 'fixed' amount of data, the time it takes increases due to the memory copying needs.

The alternative is to do what is recommended in Perl - write it to disk! :) Don't do any concatenation at all - write each part, as you get it, to disk. My sample program that finished its last line at 23.2183042 seconds. It'd look something like this:

# a couple utilities

use File::Temp qw/tempdir/;
use File::Spec::Functions; # catdir

# ... and later

my $dir = tempdir();  # give me a temporary, empty place to put it
my $file = catdir($dir, 'temp.csv');

open my $fh, '>', $file
    or die "Can't open '$file' for write: $!";

print {$fh} "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    print $records[$index] ? "$records[$index]," : ",";
    print $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
    @records = @{$primaryFields{"Important Field 2"}};
    print $records[$index] ? "$records[$index]," : ",";
    foreach my $key (@keys) {
        @records = @{$csvContent{$key}};
        if($key eq $last) {
            print {$fh} $records[$index] ? "$records[$index]\n" : "\n";
        } else {
            print {$fh} $records[$index] ? "$records[$index]," : ",";
        }
    }
}

close $fh
    or warn "Can't close '$file' for some reason: $!";

Now there is no concatenation, no memory copying. It should go really, really fast.

If you try using the following instead of hand rolled code, you will see at least some speed increase and I'd expect it to be pretty dramatic. Text::CSV_XS for the CSV writing (and reading if you need it) and XML::LibXML for the XML parsing. They each use C under the hood.

String concatenation can be a sign of trouble for robustness as well as speed.

Is it necessary to assemble the whole output into the one variable $csv? A saner approach would be to use an array containing, say, one element for each record. Then you can just print the array to your output stream, or if you insist, using join to concatenate the array elements into a single scalar.

Within each record, you can also assemble the individual fields into an array, and join them at the end

  my @csv = ("Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n");
  foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    my @newRow = ();
    push @newRow, $records[$index] ? $records[$index] : "";
    # alternatively:  push @newRow, $records[$index] || "";
    push @newRow, $primaryIndex[$index]>=0 ? $primaryIndex[$index] : "";
    @records = @{$primaryFields{"Important Field 2"}};
    push @newRow, $records[$index] ? $records[$index] : "";
    foreach my $key (@keys) {
      @records = @{$csvContent{$key}};
      push @newRow, $records[$index] ? $records[$index] : "";
    }
    push @csv, join(",", @newRow) . "\n";
  }

  ...
  $csv = join '', @csv;      # if necessary

继续阅读：csv performance perl string

Efficiency problem while building csv in perl

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？