in perl. how does hash store data in memory

2023-01-10 06:29 问答作者：

I have a big xml file and parsing it consumes a lot of memory.

since I believe most of it is due to a lot of user name in the file.

I changed the length of each user name from ~28 Bytes to 10 bytes.

and run again. but it still takes almost the same amount of memory.

the xml file is so far parsed with SAX and during handling, the result is stored in a hash structure, like this:

$this->{'date'}->{'school 1'}->{$class}->{$student}...

why the memory is still so much after I reduce the length of student name? is it possible w开发者_如何学编程hen the data is stored in hash memory. there are a lot of overhead no matter how lone the length of string is?

Perl hashes use a technique known as bucket-chaining. All keys that have the same hash (see the macro PERL_HASH_INTERNAL in hv.h) go in the same “bucket,” a linear list.

According to the perldata documentation

If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals "1/16" , which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items. This isn't supposed to happen. If a tied hash is evaluated in scalar context, a fatal error will result, since this bucket usage information is currently not available for tied hashes.

To see whether your dataset has a pathological distribution, you could inspect the various levels in scalar context, e.g.,

print scalar(%$this), "\n",
      scalar(%{ $this->{date} }), "\n",
      scalar(%{ $this->{date}{"school 1"} }), "\n",
      ...

For a somewhat dated overview, see How Hashes Really Work at perl.com.

The modest reduction in the lengths of students' names, keys that are four levels down, won't make a significant difference. In general, the perl implementation has a strong bias toward throwing memory at problems. It ain't your father's FORTRAN.

Yes - there is a LOT of overhead. If possible, don't store the data as a full tree, especially since you're using a SAX parser which frees you from the necessities of doing so imposed by a DOM one.

If you MUST store the entire tree, one possible workaround is storing arrays of arrays - e.g. you store all student names in an array (with, say "mary123456" being stored in $students[11], and then store a hash value that would have been ...->{"mary123456"} as ->[11] instead.

It WILL increase processing time due to extra layers of indirection, but might decrease due to less memory usage and thus less swapping/thrashing.

Another option is using hashes tied to files, though it would be REALLY slow due to disk IO bottleneck, of course.

It may be useful to use the Devel::Size module that can report back how big various data structures are:

use Devel::Size qw(total_size);
print "Total Size is: ".total_size($hashref)."\n";

继续阅读：perl

in perl. how does hash store data in memory

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？