开发者

How do I sort a Perl hash that has tons of data?

I am sorting a hash in Perl. I encountered an Out of memory error when running my Perl Sc开发者_JAVA技巧ript:

foreach $key (sort (keys(%hash))) {
   ....
}

How do I sort a hash that has tons of data?


sort keys %hash is inefficient for a large %hash in that, memory wise, its roughly equivalent to:

my @keys = keys %hash;
@keys = sort @keys;

In that it has to keep three copies of the keys in memory while doing the sorting (one in the hash, one in the list of keys, one in the sorted list being created). foreach's memory optimizations for iterators do not apply.

Since the hash is so large, the best option is to get it entirely out of memory. Stick it in a BerkeleyDB file. And if you want to keep the keys in order a hash isn't the best option, a tree is. I'd suggest using a Berkeley BTree file. Trees will efficiently keep your data sorted like an array while providing fast lookup like a hash.

Here's an example using BerkeleyDB. DB_File is simpler and better documented but does not take advantage of modern features of BerkeleyDB. YMMV.

use BerkeleyDB;

my $db  = tie my %hash, 'BerkeleyDB::Btree',
              -Filename => "your.db",
              -Compare  => sub { $_[1] cmp $_[0] },
              -Flags    => DB_CREATE;

-Compare illustrates how to supply your own sorting function. The tied interface will be sluggish. Unless you need it to act like a hash, use the object interface.


Perl FAQ has some examples to sort a hash. Look at How do I sort a hash? and here is A Fresh Look at Efficient Perl Sorting.


If your keys are integers, numbers or strings of a small maximum size, you can use Sort::Packed:

use Sort::Packed qw(sort_packed);

my $hash_size = keys %hash;
my $max_key_len = 4;  
my $packed_keys = '\0' x ($max_key_len * $hash_size);
my $ix = 0;
while (my ($key, $value) = each %hash) {
  my $key_len = length $k;
  $key_len <= $max_key_len or die "key $key is too big";
  substr($packed_keys, $ix, $key_len, $key);
  $ix += $max_key_len;
}

sort_packed("C$max_key_len", $packed_keys);

$ix = 0;
while ($ix < length $packed_keys) {
  my $key = substr($packed_keys, $ix, $max_key_len);
  $key =~ s/\0+$//;
  print "$key\n";
  $ix += $max_key_len;
}

Admittedly, this code is quite ugly, but it will keep memory usage to the minimum.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜