What is 4/16 in hashes?
if (%hash){ print "That 开发者_开发知识库was a true value!\n"; }
That will be true if (and only if) the hash has at least one key-value pair.
The actual result is an internal debugging string useful to the people who maintain Perl. It looks something like "4/16," but the value is guaranteed to be true when the hash is nonempty, and false when it's empty. --Llama book
What is this 4/16? Can anyone show me a small program from where I can see that the result is 4/16?
From perldoc perldata:
If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals "1/16" , which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items.
so, 4/16
would be the buckets used/allocated count, and something like the following will display this value:
%hash = (1, 2);
print scalar(%hash); #prints 1/8 here
A hash is an array of linked lists. A hashing function converts the key into a number which is used as the index of the array element ("bucket") into which to store the value. More than one key can hash to the same index ("collision"), a situation handled by the linked lists.
The denominator of the fraction is the total number of buckets.
The numerator of the fraction is the number of buckets which has one or more elements.
For hashes with the same number of elements, the higher the number, the better. The one that returns 6/8 has fewer collisions than the one that returns 4/8.
This is a slightly modified version of an email I sent to the Perl Beginners mailing list answering this same question.
Saying
my $hash_info = %hash;
Will get you either 0
(if the hash is empty) or the ratio of used to
total buckets. This information is almost, but not completely,
useless to you. To understand what that means you must first
understand how hashing works.
Lets implement a hash using Perl 5. The first thing we need is a hashing function. Hashing functions turn strings into, hopefully, unique numbers. Examples of real strong hashing functions are MD5 or SHA1, but they tend to be too slow for common use, so people tend to use weaker (i.e. ones that produce less unique output) functions for hash tables. Perl 5 uses Bob Jenkins [one-at-a-time] algorithm, which has a nice tradeoff of uniqueness to speed. For our example, I will use a very weak hashing function:
#!/usr/bin/perl
use strict;
use warnings;
sub weak_hash {
my $key = shift;
my $hash = 1;
#multiply every character in the string's ASCII/Unicode value together
for my $character (split //, $key) {
$hash *= ord $character;
}
return $hash;
}
for my $string (qw/cat dog hat/) {
print "$string hashes to ", weak_hash($string), "\n";
}
Because hashing functions tend to give back numbers that are from a range larger than we want, you usually use modulo to reduce the range of numbers it gives back:
#!/usr/bin/perl
use strict;
use warnings;
sub weak_hash {
my $key = shift;
my $hash = 1;
#multiply every character in the string's ASCII/Unicode value together
for my $character (split //, $key) {
$hash *= ord $character;
}
return $hash;
}
for my $string (qw/cat dog hat/) {
# the % operator is constraining the number
# weak_hash returns to 0 - 10
print "$string hashes to ", weak_hash($string) % 11, "\n";
}
Now that we have a hashing function, we need somewhere to save the key and value. This is called the hash table. The hash table is often an array whose elements are called buckets (these are the buckets that the ratio is talking about). A bucket will hold all of the key/value pairs that hash to the same number:
#!/usr/bin/perl
use strict;
use warnings;
sub weak_hash {
my $key = shift;
my $hash = 1;
for my $character (split //, $key) {
$hash *= ord $character;
}
return $hash;
}
sub create {
my ($size) = @_;
my @hash_table;
#set the size of the array
$#hash_table = $size - 1;
return \@hash_table;
}
sub store {
my ($hash_table, $key, $value) = @_;
#create an index into $hash_table
#constrain it to the size of the hash_table
my $hash_table_size = @$hash_table;
my $index = weak_hash($key) % $hash_table_size;
#push the key/value pair onto the bucket at the index
push @{$hash_table->[$index]}, {
key => $key,
value => $value
};
return $value;
}
sub retrieve {
my ($hash_table, $key) = @_;
#create an index into $hash_table
#constrain it to the size of the hash_table
my $hash_table_size = @$hash_table;
my $index = weak_hash($key) % $hash_table_size;
#get the bucket for this key/value pair
my $bucket = $hash_table->[$index];
#find the key/value pair in the bucket
for my $pair (@$bucket) {
return $pair->{value} if $pair->{key} eq $key;
}
#if key isn't in the bucket:
return undef;
}
sub list_keys {
my ($hash_table) = @_;
my @keys;
for my $bucket (@$hash_table) {
for my $pair (@$bucket) {
push @keys, $pair->{key};
}
}
return @keys;
}
sub print_hash_table {
my ($hash_table) = @_;
for my $i (0 .. $#$hash_table) {
print "in bucket $i:\n";
for my $pair (@{$hash_table->[$i]}) {
print "$pair->{key} => $pair->{value}\n";
}
}
}
my $hash_table = create(3);
my $i = 0;
for my $key (qw/a b c d g j/) {
store($hash_table, $key, $i++);
}
print_hash_table($hash_table);
print "the a key holds: ", retrieve($hash_table, "a"), "\n";
As we can see from this example, it is possible for one bucket have more key/value pairs than the others. This is a bad situation to be in. It cause the hash to be slow for that bucket. This is one of the uses of the ratio of used to total buckets that hashes return in scalar context. If the hash says that only a few buckets are being used, but they are lots of keys in the hash, then you know you have a problem.
To learn more about hashes, ask questions here about what I have said, or read about them.
Adding another answer because the first one is already too long.
Another approach to seeing what "4/16"
means is to use the Hash::Esoteric
module (warning alpha quality code). I wrote it to give me a better picture of what was going on inside of a hash so I could try to understand a performance problem that large hashes seem to have. The keys_by_bucket
function from Hash::Esoteric
will return all of the keys from a hash, but instead of returning them as a list like keys
does, it returns them as an AoA
where the top level represents buckets and the arrayref inside of it holds the keys in that bucket.
#!/user/bin/env perl
use strict;
use warnings;
use Hash::Esoteric qw/keys_by_bucket/;
my %hash = map { $_ => undef } "a" .. "g";
my $buckets = keys_by_bucket \%hash;
my $used;
for my $i (0 .. $#$buckets) {
if (@{$buckets->[$i]}) {
$used++;
}
print "bucket $i\n";
for my $key (@{$buckets->[$i]}) {
print "\t$key\n";
}
}
print "scalar %hash: ", scalar %hash, "\n",
"used/total buckets: $used/", scalar @$buckets, "\n";
The code above prints out something like (the actual data depends on the version of Perl) this:
bucket 0
e
bucket 1
c
bucket 2
a
bucket 3
g
b
bucket 4
bucket 5
d
bucket 6
f
bucket 7
scalar %hash: 6/8
used/total buckets: 6/8
The fraction is the fill rate of the hash: used buckets vs allocated buckets. Also sometimes called load factor.
To actually get "4/16" you'll need some tricks. 4 keys will lead to 8 buckets. Thus you need at least 9 keys, and then delete 5.
$ perl -le'%h=(0..16); print scalar %h; delete $h{$_} for 0..8; print scalar %h'
9/16
4/16
Note that your numbers will vary, as the seed is randomized, and you'll cannot predict the exact collisions
The fill rate is critical hash information when to rehash. Perl 5 rehashes at a fill rate of 100%, see the DO_HSPLIT
macro in hv.c
. Thus it trades memory for read-only speed. A normal fill rate would be between 80%-95%. You always leave holes to save some collisions. Lower fill rates lead to faster accesses (less collisions), but a higher number of rehashes.
You don't immediately see the number of collisions with the fraction. You need keys %hash
also, to compare to the numerator of the fraction, the used buckets number.
Thus one part of the collision quality is keys / used buckets:
my ($used, $max) = split '/',scalar(%hash);
keys %hash / $used;
But in reality you need to know the sum of the lengths of all linked lists in the buckets. You can access this quality with Hash::Util::bucket_info
($keys, $buckets, $used, @length_count)= Hash::Util::bucket_info(\%hash)
While hash access is normally O(1), with long lengths it is only O(n/2), esp. for the overlong buckets. At https://github.com/rurban/perl-hash-stats I provide statistical info of collision qualities for various hash functions for the perl5 core test suite data. I haven't tested tradeoffs for different fill rates yet, as I am rewriting the current hash tables completely.
Update: For perl5 a better fill rate than 100% would be 90%, as tested recently. But this depends on the used hash function. I used a bad and fast one: FNV1A. With better, slower hash functions you can use higher fill rates. The current default OOAT_HARD is bad AND slow, so should be avoided.
That (%hash)
evaluates the hash in a scalar context.
Here's an empty hash:
command_line_prompt> perl -le '%hash=(); print scalar %hash;'
The result is 0.
Here's a nonempty hash:
command_line_prompt> perl -le '%hash=(foo=>'bar'); print scalar %hash;'
The result is the string "1/8".
精彩评论