What's the most efficient way to check for duplicates in an array of data using Perl?

2023-01-03 10:00 问答作者：

I n开发者_JS百科eed to see if there are duplicates in an array of strings, what's the most time-efficient way of doing it?

One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.

use strict;
use warnings;

my @array = qw/yes no maybe true false false perhaps no/;

my %seen;

foreach my $string (@array) {

    next unless $seen{$string}++;
    print "'$string' is duplicated.\n";
}

Output

'false' is duplicated.

'no' is duplicated.

Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

#!/usr/bin/perl

use strict;
use warnings;

my %count;
my %dups;
for(@array) {
    $dups{$_}++ if $count{$_}++;
}

A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

# not exactly sort in place, but Perl does a decent job optimizing it
@array = sort @array;

my $last;
my %dups;
for my $entry (@array) {
    $dups{$entry}++ if defined $last and $entry eq $last;
    $last = $entry;
}

This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.

If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:

use strict;
use warnings;
use List::MoreUtils 'uniq';

my @array = qw(1 1 2 3 fibonacci!);
my @array_uniq = uniq @array;
print ((scalar(@array) == scalar(@array_uniq)) ? "no dupes" : "dupes") . " found!\n";

Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:

my %uniq_elements;
foreach my $element (@array)
{
    die "dupe found!" if $uniq_elements{$element}++;
}

Create a hash or a set or use a collections.Counter().

As you encounter each string/input check to see if there's an instance of that in the hash. If so, it's a duplicate (do whatever you want about those). Otherwise add a value (such as, oh, say, the numeral one) to the hash, using the string as the key.

Example (using Python collections.Counter):

#!python
import collections
counts = collections.Counter(mylist)
uniq = [i for i,c in counts.iteritems() if c==1]
dupes = [i for i, c in counts.iteritems() if c>1]

These Counters are built around dictionaries (Pythons name for hashed mapping collections).

This is time efficient because hash keys are indexed. In most cases the lookup and insertion time for keys is done in near constant time. (In fact Perl "hashes" are so-called because they are implemented using an algorithmic trick called "hashing" --- a sort of checksum chosen for its extremely low probability of collision when fed arbitrary inputs).

If you initialize values to integers, starting with 1, then you can increment each value as you find its key already in the hash. This is just about the most efficient general purpose means of counting strings.

Not a direct answer, but this will return an array without duplicates:

#!/usr/bin/perl

use strict;
use warnings;

my @arr = ('a','a','a','b','b','c');
my %count;
my @arr_no_dups = grep { !$count{$_}++ } @arr;

print @arr_no_dups, "\n";

Please don't ask about the most time efficient way to do something unless you have some specific requirements, such as "I have to dedupe a list of 100,000 integers in under a second." Otherwise, you're worrying about how long something takes for no reason.

similar to @Schwern's second solution, but checks for duplicates a little earlier from within the comparison function of sort:

use strict;
use warnings;

@_ = sort { print "dup = $a$/" if $a eq $b; $a cmp $b } @ARGV;

it won't be as fast as the hashing solutions, but it requires less memory and is pretty darn cute

继续阅读：arrays perl

What's the most efficient way to check for duplicates in an array of data using Perl?

Output

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Output

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？