sort upper case just before lowercase key values from a hash
I have an hash and i want t开发者_JS百科o sort based on the keys with upper case words appearing just before the lowercase words.
Example:
JANE
jane JIM jimTo get the keys in order, apply sort
with a custom sort function on the keys of the hash.
my %hash = ( JANE => 1, jane => 2, JIM => 3, jim => 4 );
my @sorted_keys = sort {
lc $a cmp lc $b
|| $a cmp $b
} keys %hash;
This custom sort function compares strings first as if they were of the same case, and if equal, takes case into account.
Unicode Collation
Although it may seem like overkill for this operation, the standard Unicode::Collate and Unicode::Collate::Locale modules are made for this sort of thing. They also sort non-ASCII data alphabetically, which the normal sort
will not do.
use utf8;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
@sorts = sort @names;
That gives you the sort order of
JANE JIM Mary María jane jim josie josé mark
which nobody wants. This is much better:
use utf8;
use Unicode::Collate;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
$coll = new Unicode::Collate;
@sorts = $coll->sort(@names);
That gives you
jane JANE jim JIM josé josie María mark Mary
If you want uppercase before lowercase, specify that this way:
use utf8;
use Unicode::Collate;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
$coll = new Unicode::Collate upper_before_lower => 1;
@sorts = $coll->sort(@names);
print "@sorts\n";
which yields:
JANE jane JIM jim josé josie María mark Mary
Simple Compares
You can use collation objects’ cmp
method on a pair of strings in the customary fashion, like
#!/usr/bin/env perl
use 5.10.1;
use strict;
use autodie;
use warnings qw[ FATAL all ];
use utf8;
use open qw[ :std IO :utf8 ];
use Unicode::Collate;
my @names = qw[ fum fee fie foe ];
my $coll = Unicode::Collate->new;
my @sorts = $coll->sort(@names);
say "@names => @sorts\n";
for (
my($a, $b) = splice @names, 0, 2;
2 == grep {defined} $a, $b;
($a, $b) = ($b, shift @names)
)
{
given ($coll->cmp($a, $b)) {
when (-1) { say "$a < $b" }
when ( 0) { say "$a = $b" }
when (+1) { say "$a > $b" }
default { die "NOT REACHED" }
}
}
which produces:
fum fee fie foe => fee fie foe fum
fum > fee
fee < fie
fie < foe
Fancier Alphabetic Sorts of Unicode
Now consider a list of words like this:
sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET
If you run the default sort on that, you get the virtually useless:
SET SSET saet sat seat set sot ssét sát sät sæt sét tot ßet ſAT ſet
And a case-sensitive sort is really no better:
use utf8;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
@sorts = sort {
lc $a cmp lc $b
||
$a cmp $b
} @names;
print "@sorts\n";
producing the still stupid-and-wrong:
saet sat seat SET set sot SSET ssét sát sät sæt sét tot ßet ſAT ſet
But here it is with a standard Unicode sort:
use utf8;
use Unicode::Collate;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
$coll = new Unicode::Collate upper_before_lower => 1;
@sorts = $coll->sort(@names);
print "@sorts\n";
producing the ‘correcter’ (read: infinitely preferable) version of:
saet sæt sät sat sát ſAT seat SET set sét ſet sot SSET ssét ßet tot
Locale Sorts
The Unicode::Collate module is pretty fast, so you should not hestitate to use it on your route character sorting needs. But sometimes that just isn’t enough. That’s because different languages have different ideas of alphabets.
- Latin (archaic): a b c d e f z h i k l m n o p q r s t v x
- Latin (classic): a b c d e f g h i k l m n o p q r s t v x y z
- Spanish (traditional): a b c ch d e f g h i j k l ll m n ñ o p q r rr s t u v x w y z
- Spanish (recent): a b c d e f g h i j k l m n ñ o p q r s t u v x w y z
- Catalan: a b c ç d e f g h i j k l m n o p q r s t u v x w y z
- Welsh: a b c ch d dd e f ff g ng h i l ll m n o p ph r rh s t th u w y
- Danish: a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
- Icelandic: a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö
- Old English: a b c d e f ȝ/g h i k l m n o p q r s t v x y z & ⁊ ƿ þ ð æ
- Middle English: a b c d e f g h i k l m n o p q r ſ/s t v x y z ȝ ƿ þ ð æ
- Futhorc (transliterated): f u þ o r c ȝ w h n i j eo p x s t b e m l ŋ d œ a æ y ea io cw k st g
- Greek: α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ/ς τ υ φ χ ψ ω
- Cyrillic: а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я
- Cherokee: Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ Ꮉ Ꮊ Ꮋ Ꮌ Ꮍ Ꮎ Ꮏ Ꮐ Ꮑ Ꮒ Ꮓ Ꮔ Ꮕ Ꮖ Ꮗ Ꮘ Ꮙ Ꮚ Ꮛ Ꮜ Ꮝ Ꮞ Ꮟ Ꮠ Ꮡ Ꮢ Ꮣ Ꮤ Ꮥ Ꮦ Ꮧ Ꮨ Ꮩ Ꮪ Ꮫ Ꮬ Ꮭ Ꮮ Ꮯ Ꮰ Ꮱ Ꮲ Ꮳ Ꮴ Ꮵ Ꮶ Ꮷ Ꮸ Ꮹ Ꮺ Ꮻ Ꮼ Ꮽ Ꮾ Ꮿ Ᏸ Ᏹ Ᏺ Ᏻ Ᏼ
BTW, those are also good examples why “ever hardcoding [a-z]
into your program is always wrong, sometimes.” It’s full of idiotic and even insulting assumptions. Note that all but the last three of these are actually considered Latin alphabets! That’s the same script as we use in English. In representing English text, I’ve variously had to deal with learnèd, Æneid, poſt, Laȝamon, résumé, 1ˢᵗ, MᶜKinley, Van Dijke, Cañon City Colorado, œnology, Dzur, rôle, ⅷ, première, Bjørn, naïve, coöperate, façade, café, Merððyn, archæology, and even tschüß. Repeat the mantra: “Hardcoding [a-z]
into your program is always wrong, sometimes.” Just Say No!
The Unicode::Collate::Locale module handles local sorting conventions. Just as English phonebooks and bookshelves have special ways of sorting names so that it doesn’t metter whether you’ve spelt something McBride or MacBride, the German-speaking world sorts their names such that Händel and Haendel are the same. That’s why without diacritics, one must obligatorily write über‑ as ueber‑ and Übermensch as Uebermensch. A locale sort knows to do this:
use utf8;
use Unicode::Collate::Locale;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
$coll = new Unicode::Collate::Locale::
locale => de__phonebook,
upper_before_lower => 1,
;
@sorts = $coll->sort(@names);
print "@sorts\n";
now produces
saet sæt sät sat sát ſAT seat SET set sét ſet sot SSET ssét ßet tot
Se habla castellano
It’s remarkable how different from one’s own other countries’ locale conventions can be. In the Spanish locale ("es"), ñ is a letter that comes after n and before o. That means that the correct sort of
raña rastrillo radio rana rápido ráfaga ranúnculo
is
radio ráfaga rana raña ranúnculo rápido rastrillo
Say those all really fast with a fully-rolled rr to loosen your tongue. :)
The "es__traditional" locale is a little different; historically, chocolate came after color in the Spanish dictionary, unlike the way it works in Enlgish. That’s because ch came after c and before d, while ll came after l and before m. That means that this sequence:
lástima laña llama ligante
cidra caliente color chocolate con churros
pero pera Perú perro periglo peste
sorts to
caliente cidra color con chocolate churros
laña lástima ligante llama
pera periglo pero perro Perú peste
Use a custom sort which first compares the items based on their lowercased representations (so that all variations of "jane" appear before variations of "jim"), then resolves ties by doing a default ASCII comparison (where uppercase comes before lowercase):
perl -e 'print join "\n", sort { lc $a cmp lc $b || $a cmp $b } qw( jim JANE jane JIM )'
Output:
JANE
jane
JIM
jim
Try:
@list = ("jane","JIM","JANE","jim");
print sort { uc $a cmp uc $b or $a cmp $b } @list;
精彩评论