开发者

Faster way to check for element in array?

This function does the same as exists does with hashes.

I plan on use it a lot.

Can it be optimized in some way?

my @a 开发者_运维知识库= qw/a b c d/;

my $ret = array_exists("b", @a);

sub array_exists {
    my ($var, @a) = @_;

    foreach my $e (@a) {
        if ($var eq $e) {
            return 1;
        }
    }
    return 0;
}


If you have to do this a lot on a fixed array, use a hash instead:

 my %hash = map { $_, 1 } @array;

 if( exists $hash{$key} ) { ... }

Some people reach for the smart match operator, but that's one of the features that we need to remove from Perl. You need to decide if this should match, where the array hold an array reference that has a hash reference with the key b:

use 5.010;

my @a = (
    qw(x y z),
    [ { 'b' => 1 } ],
    );

say 'Matches' if "b" ~~ @a; # This matches

Since the smart match is recursive, if keeps going down into data structures. I write about some of this in Rethinking smart matching.


You can use smart matching, available in Perl 5.10 and later:

if ("b" ~~ @a) {
    # "b" exists in @a
}

This should be much faster than a function call.


I'd use List::MoreUtils::any.

my $ret = any { $_ eq 'b' } @a;


Since there are lots of similar questions on StackOverflow where different "correct answers" return different results, I tried to compare them. This question seems to be a good place to share my little benchmark.

For my tests I used a test set (@test_set) of 1,000 elements (strings) of length 10 where only one element ($search_value) matches a given string.

I took the following statements to validate the existence of this element in a loop of 100,000 turns.

_grep

grep( $_ eq $search_value, @test_set )

_hash

{ map { $_ => 1 } @test_set }->{ $search_value }

_hash_premapped

$mapping->{ $search_value }
  • uses a $mapping that is precalculated as $mapping = { map { $_ => 1 } @test_set } (which is included in the final measuring)

_regex

sub{ my $rx = join "|", map quotemeta, @test_set; $search_value =~ /^(?:$rx)$/ }

_regex_prejoined

$search_value =~ /^(?:$rx)$/
  • uses a regular expression $rx that is precalculated as $rx = join "|", map quotemeta, @test_set; (which is included in the final measuring)

_manual_first

sub{ foreach ( @test_set ) { return 1 if( $_ eq $search_value ); } return 0; }

_first

first { $_ eq $search_value } @test_set
  • from List::Util (version 1.38)

_smart

$search_value ~~ @test_set

_any

any { $_ eq $search_value } @test_set
  • from List::MoreUtils (version 0.33)

On my machine ( Ubuntu, 3.2.0-60-generic, x86_64, Perl v5.14.2 ) I got the following results. The shown values are seconds and returned by gettimeofday and tv_interval of Time::HiRes (version 1.9726).

Element $search_value is located at position 0 in array @test_set

_hash_premapped:    0.056211
_smart:             0.060267
_manual_first:      0.064195
_first:             0.258953
_any:               0.292959
_regex_prejoined:   0.350076
_grep:              5.748364
_regex:             29.27262
_hash:              45.638838

Element $search_value is located at position 500 in array @test_set

_hash_premapped:    0.056316
_regex_prejoined:   0.357595
_first:             2.337911
_smart:             2.80226
_manual_first:      3.34348
_any:               3.408409
_grep:              5.772233
_regex:             28.668455
_hash:              45.076083

Element $search_value is located at position 999 in array @test_set

_hash_premapped:    0.054434
_regex_prejoined:   0.362615
_first:             4.383842
_smart:             5.536873
_grep:              5.962746
_any:               6.31152
_manual_first:      6.59063
_regex:             28.695459
_hash:              45.804386

Conclusion

The fastest method to check the existence of an element in an array is using prepared hashes. You of course buy that by an proportional amount of memory consumption and it only makes sense if you search for elements in the set multiple times. If your task includes small amounts of data and only a single or a few searches, hashes can even be the worst solution. Not the same way fast, but a similar idea would be to use prepared regular expressions, which seem to have a smaller preparation time.

In many cases, a prepared environment is no option.

Surprisingly List::Util::first has very good results, when it comes to the comparison of statements, that don't have a prepared environment. While having the search value at the beginning (which could be perhaps interpreted as the result in smaller sets, too) it is very close to the favourites ~~ and any (and could even be in the range of measurement inaccuracy). For items in the middle or at the end of my larger test set, first is definitely the fastest.


brian d foy suggested using a hash, which gives O(1) lookups, at the cost of slightly more expensive hash creation. There is a technique that Marc Jason Dominus describes in his book Higher Order Perl where by a hash is used to memoize (or cache) results of a sub for a given parameter. So for example, if findit(1000) always returns the same thing for the given parameter, there's no need to recalculate the result every time. The technique is implemented in the Memoize module (part of the Perl core).

Memoizing is not always a win. Sometimes the overhead of the memoized wrapper is greater than the cost of calculating a result. Sometimes a given parameter is unlikely to ever be checked more than once or a relatively few times. And sometimes it cannot be guaranteed that the result of a function for a given parameter will always be the same (ie, the cache can become stale). But if you have an expensive function with stable return values per parameter, memoization can be a big win.

Just as brian d foy's answer uses a hash, Memoize uses a hash internally. There is additional overhead in the Memoize implementation, but the benefit to using Memoize is that it doesn't require refactoring the original subroutine. You just use Memoize; and then memoize( 'expensive_function' );, provided it meets the criteria for benefitting from memoization.

I took your original subroutine and converted it to work with integers (just for simplicity in testing). Then I added a second version that passed a reference to the original array rather than copying the array. With those two versions, I created two more subs that I memoized. I then benchmarked the four subs.

In benchmarking, I had to make some decisions. First, how many iterations to test. The more iterations we test, the more likely we are to have good cache hits for the memoized versions. Then I also had to decide how many items to put into the sample array. The more items, the less likely to have cache hits, but the more significant the savings when a cache hit occurs. I ultimately decided on an array to be searched containing 8000 elements, and chose to search through 24000 iterations. That means that on average there should be two cache hits per memoized call. (The first call with a given param will write to the cache, while the second and third calls will read from the cache, so two good hits on average).

Here is the test code:

use warnings;
use strict;
use Memoize;
use Benchmark qw/cmpthese/;

my $n = 8000; # Elements in target array
my $count = 24000; # Test iterations.

my @a = ( 1 .. $n );
my @find = map { int(rand($n)) } 0 .. $count;
my ( $orx, $ormx, $opx, $opmx ) = ( 0, 0, 0, 0 );

memoize( 'orig_memo' );
memoize( 'opt_memo'  );

cmpthese( $count, {
    original  => sub{ my $ret =  original( $find[ $orx++  ],  @a ); },
    orig_memo => sub{ my $ret = orig_memo( $find[ $ormx++ ],  @a ); },
    optimized => sub{ my $ret = optimized( $find[ $opx++  ], \@a ); },
    opt_memo  => sub{ my $ret =  opt_memo( $find[ $opmx++ ], \@a ); }
} );

sub original {
    my ( $var, @a) = @_;
    foreach my $e ( @a ) {
        return 1 if $var == $e;
    }
    return 0;
}

sub orig_memo {
    my ( $var, @a ) = @_;
    foreach my $e ( @a ) {
        return 1 if $var == $e;
    }
    return 0;
}

sub optimized {
    my( $var, $aref ) = @_;
    foreach my $e ( @{$aref} ) {
        return 1 if $var == $e;
    }
    return 0;
}

sub opt_memo {
    my( $var, $aref ) = @_;
    foreach my $e ( @{$aref} ) {
        return 1 if $var == $e;
    }
    return 0;
}

And here are the results:

             Rate orig_memo  original optimized  opt_memo
orig_memo   876/s        --      -10%      -83%      -94%
original    972/s       11%        --      -82%      -94%
optimized  5298/s      505%      445%        --      -66%
opt_memo  15385/s     1657%     1483%      190%        --

As you can see, the memoized version of your original function was actually slower. That's because so much of the cost of your original subroutine was spent in making copies of the 8000 element array, combined with the fact that there is additional call-stack and bookkeeping overhead with the memoized version.

But once we pass an array reference instead of a copy, we remove the expense of passing the entire array around. Your speed jumps considerably. But the clear winner is the optimized (ie, passing array refs) version that we memoized (cached), at 1483% faster than your original function. With memoization the O(n) lookup only happens the first time a given parameter is checked. Subsequent lookups occur in O(1) time.

Now you would have to decide (by Benchmarking) whether memoization helps you. Certainly passing an array ref does. And if memoization doesn't help you, maybe brian's hash method is best. But in terms of not having to rewrite much code, memoization combined with passing an array ref may be an excellent alternative.


Your current solution iterates through the array before it finds the element it is looking for. As such, it is a linear algorithm.

If you sort the array first with a relational operator (>for numeric elements, gt for strings) you can use binary search to find the elements. It is a logarithmic algorithm, much faster than linear.

Of course, one must consider the penalty of sorting the array in the first place, which is a rather slow operation (n log n). If the contents of the array you are matching against change often, you must sort after every change, and it gets really slow. If the contents remain the same after you've initially sorted them, binary search ends up being practically faster.


You can use grep:

sub array_exists {
  my $val = shift;
  return grep { $val eq $_ } @_;
}

Surprisingly, it's not off too far in speed from List::MoreUtils' any(). It's faster if your item is at the end of the list by about 25% and slower by about 50% if your item is at the start of the list.

You can also inline it if needed -- no need to shove it off into a subroutine. i.e.

if ( grep { $needle eq $_ } @haystack ) {
  ### Do something
  ...
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜