开发者

sort upper case just before lowercase key values from a hash

I have an hash and i want t开发者_JS百科o sort based on the keys with upper case words appearing just before the lowercase words.

Example:

JANE

jane

JIM

jim


To get the keys in order, apply sort with a custom sort function on the keys of the hash.

my %hash = ( JANE => 1, jane => 2, JIM => 3, jim => 4 );
my @sorted_keys = sort {
    lc $a cmp lc $b
        || $a cmp $b
} keys %hash;

This custom sort function compares strings first as if they were of the same case, and if equal, takes case into account.


Unicode Collation

Although it may seem like overkill for this operation, the standard Unicode::Collate and Unicode::Collate::Locale modules are made for this sort of thing. They also sort non-ASCII data alphabetically, which the normal sort will not do.

use utf8;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
@sorts = sort @names;

That gives you the sort order of

JANE JIM Mary María jane jim josie josé mark

which nobody wants. This is much better:

use utf8;
use Unicode::Collate;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
$coll = new Unicode::Collate;
@sorts = $coll->sort(@names);

That gives you

jane JANE jim JIM josé josie María mark Mary

If you want uppercase before lowercase, specify that this way:

use utf8;
use Unicode::Collate;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
$coll = new Unicode::Collate upper_before_lower => 1;
@sorts = $coll->sort(@names);
print "@sorts\n";

which yields:

JANE jane JIM jim josé josie María mark Mary

Simple Compares

You can use collation objects’ cmp method on a pair of strings in the customary fashion, like

#!/usr/bin/env perl

use 5.10.1;
use strict;
use autodie; 
use warnings qw[ FATAL all ];
use utf8;
use open qw[ :std IO :utf8 ];
use Unicode::Collate;

my @names = qw[ fum fee fie foe ];
my $coll = Unicode::Collate->new;
my @sorts = $coll->sort(@names);
say "@names => @sorts\n";

for (
      my($a, $b) = splice @names, 0, 2;
      2 == grep {defined} $a, $b;
      ($a, $b) = ($b, shift @names)
    )
{
    given ($coll->cmp($a, $b)) {
        when (-1) { say "$a < $b" }
        when ( 0) { say "$a = $b" }
        when (+1) { say "$a > $b" }
        default   { die "NOT REACHED" }
    }
}

which produces:

fum fee fie foe => fee fie foe fum

fum > fee
fee < fie
fie < foe

Fancier Alphabetic Sorts of Unicode

Now consider a list of words like this:

sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET

If you run the default sort on that, you get the virtually useless:

SET SSET saet sat seat set sot ssét sát sät sæt sét tot ßet ſAT ſet

And a case-sensitive sort is really no better:

use utf8;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
@sorts = sort {
    lc $a  cmp  lc $b
           ||
       $a  cmp  $b
} @names;
print "@sorts\n";

producing the still stupid-and-wrong:

saet sat seat SET set sot SSET ssét sát sät sæt sét tot ßet ſAT ſet

But here it is with a standard Unicode sort:

use utf8;
use Unicode::Collate;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
$coll = new Unicode::Collate upper_before_lower => 1;
@sorts = $coll->sort(@names);
print "@sorts\n";

producing the ‘correcter’ (read: infinitely preferable) version of:

saet sæt sät sat sát ſAT seat SET set sét ſet sot SSET ssét ßet tot

Locale Sorts

The Unicode::Collate module is pretty fast, so you should not hestitate to use it on your route character sorting needs. But sometimes that just isn’t enough. That’s because different languages have different ideas of alphabets.

  • Latin (archaic): a b c d e f z h i k l m n o p q r s t v x
  • Latin (classic): a b c d e f g h i k l m n o p q r s t v x y z
  • Spanish (traditional): a b c ch d e f g h i j k l ll m n ñ o p q r rr s t u v x w y z
  • Spanish (recent): a b c d e f g h i j k l m n ñ o p q r s t u v x w y z
  • Catalan: a b c ç d e f g h i j k l m n o p q r s t u v x w y z
  • Welsh: a b c ch d dd e f ff g ng h i l ll m n o p ph r rh s t th u w y
  • Danish: a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
  • Icelandic: a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö
  • Old English: a b c d e f ȝ/g h i k l m n o p q r s t v x y z & ⁊ ƿ þ ð æ
  • Middle English: a b c d e f g h i k l m n o p q r ſ/s t v x y z ȝ ƿ þ ð æ
  • Futhorc (transliterated): f u þ o r c ȝ w h n i j eo p x s t b e m l ŋ d œ a æ y ea io cw k st g
  • Greek: α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ/ς τ υ φ χ ψ ω
  • Cyrillic: а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я
  • Cherokee: Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ Ꮉ Ꮊ Ꮋ Ꮌ Ꮍ Ꮎ Ꮏ Ꮐ Ꮑ Ꮒ Ꮓ Ꮔ Ꮕ Ꮖ Ꮗ Ꮘ Ꮙ Ꮚ Ꮛ Ꮜ Ꮝ Ꮞ Ꮟ Ꮠ Ꮡ Ꮢ Ꮣ Ꮤ Ꮥ Ꮦ Ꮧ Ꮨ Ꮩ Ꮪ Ꮫ Ꮬ Ꮭ Ꮮ Ꮯ Ꮰ Ꮱ Ꮲ Ꮳ Ꮴ Ꮵ Ꮶ Ꮷ Ꮸ Ꮹ Ꮺ Ꮻ Ꮼ Ꮽ Ꮾ Ꮿ Ᏸ Ᏹ Ᏺ Ᏻ Ᏼ

BTW, those are also good examples why “ever hardcoding [a-z] into your program is always wrong, sometimes.” It’s full of idiotic and even insulting assumptions. Note that all but the last three of these are actually considered Latin alphabets! That’s the same script as we use in English. In representing English text, I’ve variously had to deal with learnèd, Æneid, poſt, Laȝamon, résumé, 1ˢᵗ, MᶜKinley, Van Dijke, Cañon City Colorado, œnology, Dzur, rôle, ⅷ, première, Bjørn, naïve, coöperate, façade, café, Merððyn, archæology, and even tschüß. Repeat the mantra: “Hardcoding [a-z] into your program is always wrong, sometimes.” Just Say No!

The Unicode::Collate::Locale module handles local sorting conventions. Just as English phonebooks and bookshelves have special ways of sorting names so that it doesn’t metter whether you’ve spelt something McBride or MacBride, the German-speaking world sorts their names such that Händel and Haendel are the same. That’s why without diacritics, one must obligatorily write über‑ as ueber‑ and Übermensch as Uebermensch. A locale sort knows to do this:

use utf8;
use Unicode::Collate::Locale;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];

$coll = new Unicode::Collate::Locale::
            locale             => de__phonebook,
            upper_before_lower => 1,
        ;

@sorts = $coll->sort(@names);
print "@sorts\n";

now produces

saet sæt sät sat sát ſAT seat SET set sét ſet sot SSET ssét ßet tot

Se habla castellano

It’s remarkable how different from one’s own other countries’ locale conventions can be. In the Spanish locale ("es"), ñ is a letter that comes after n and before o. That means that the correct sort of

raña rastrillo radio rana rápido ráfaga ranúnculo

is

radio ráfaga rana raña ranúnculo rápido rastrillo

Say those all really fast with a fully-rolled rr to loosen your tongue. :)

The "es__traditional" locale is a little different; historically, chocolate came after color in the Spanish dictionary, unlike the way it works in Enlgish. That’s because ch came after c and before d, while ll came after l and before m. That means that this sequence:

lástima laña llama ligante
cidra caliente color chocolate con churros
pero pera Perú perro periglo peste

sorts to

caliente cidra color con chocolate churros 
laña lástima ligante llama 
pera periglo pero perro Perú peste


Use a custom sort which first compares the items based on their lowercased representations (so that all variations of "jane" appear before variations of "jim"), then resolves ties by doing a default ASCII comparison (where uppercase comes before lowercase):

perl -e 'print join "\n", sort { lc $a cmp lc $b || $a cmp $b } qw( jim JANE jane JIM )'

Output:

JANE
jane
JIM
jim


Try:

@list = ("jane","JIM","JANE","jim");
print sort { uc $a cmp uc $b or $a cmp $b } @list;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜