sort upper case just before lowercase key values from a hash

2023-02-10 04:27 问答作者：

I have an hash and i want t开发者_JS百科o sort based on the keys with upper case words appearing just before the lowercase words.

Example:

JANE

jane

JIM

jim

To get the keys in order, apply sort with a custom sort function on the keys of the hash.

my %hash = ( JANE => 1, jane => 2, JIM => 3, jim => 4 );
my @sorted_keys = sort {
    lc $a cmp lc $b
        || $a cmp $b
} keys %hash;

This custom sort function compares strings first as if they were of the same case, and if equal, takes case into account.

Unicode Collation

Although it may seem like overkill for this operation, the standard Unicode::Collate and Unicode::Collate::Locale modules are made for this sort of thing. They also sort non-ASCII data alphabetically, which the normal sort will not do.

use utf8;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
@sorts = sort @names;

That gives you the sort order of

JANE JIM Mary María jane jim josie josé mark

which nobody wants. This is much better:

use utf8;
use Unicode::Collate;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
$coll = new Unicode::Collate;
@sorts = $coll->sort(@names);

That gives you

jane JANE jim JIM josé josie María mark Mary

If you want uppercase before lowercase, specify that this way:

use utf8;
use Unicode::Collate;
@names = qw[ jim JANE jane JIM josé josie Mary María mark ];
$coll = new Unicode::Collate upper_before_lower => 1;
@sorts = $coll->sort(@names);
print "@sorts\n";

which yields:

JANE jane JIM jim josé josie María mark Mary

Simple Compares

You can use collation objects’ cmp method on a pair of strings in the customary fashion, like

#!/usr/bin/env perl

use 5.10.1;
use strict;
use autodie; 
use warnings qw[ FATAL all ];
use utf8;
use open qw[ :std IO :utf8 ];
use Unicode::Collate;

my @names = qw[ fum fee fie foe ];
my $coll = Unicode::Collate->new;
my @sorts = $coll->sort(@names);
say "@names => @sorts\n";

for (
      my($a, $b) = splice @names, 0, 2;
      2 == grep {defined} $a, $b;
      ($a, $b) = ($b, shift @names)
    )
{
    given ($coll->cmp($a, $b)) {
        when (-1) { say "$a < $b" }
        when ( 0) { say "$a = $b" }
        when (+1) { say "$a > $b" }
        default   { die "NOT REACHED" }
    }
}

which produces:

fum fee fie foe => fee fie foe fum

fum > fee
fee < fie
fie < foe

Fancier Alphabetic Sorts of Unicode

Now consider a list of words like this:

sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET

If you run the default sort on that, you get the virtually useless:

SET SSET saet sat seat set sot ssét sát sät sæt sét tot ßet ſAT ſet

And a case-sensitive sort is really no better:

use utf8;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
@sorts = sort {
    lc $a  cmp  lc $b
           ||
       $a  cmp  $b
} @names;
print "@sorts\n";

producing the still stupid-and-wrong:

saet sat seat SET set sot SSET ssét sát sät sæt sét tot ßet ſAT ſet

But here it is with a standard Unicode sort:

use utf8;
use Unicode::Collate;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];
$coll = new Unicode::Collate upper_before_lower => 1;
@sorts = $coll->sort(@names);
print "@sorts\n";

producing the ‘correcter’ (read: infinitely preferable) version of:

saet sæt sät sat sát ſAT seat SET set sét ſet sot SSET ssét ßet tot

Locale Sorts

The Unicode::Collate module is pretty fast, so you should not hestitate to use it on your route character sorting needs. But sometimes that just isn’t enough. That’s because different languages have different ideas of alphabets.

Latin (archaic): a b c d e f z h i k l m n o p q r s t v x
Latin (classic): a b c d e f g h i k l m n o p q r s t v x y z
Spanish (traditional): a b c ch d e f g h i j k l ll m n ñ o p q r rr s t u v x w y z
Spanish (recent): a b c d e f g h i j k l m n ñ o p q r s t u v x w y z
Catalan: a b c ç d e f g h i j k l m n o p q r s t u v x w y z
Welsh: a b c ch d dd e f ff g ng h i l ll m n o p ph r rh s t th u w y
Danish: a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
Icelandic: a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö
Old English: a b c d e f ȝ/g h i k l m n o p q r s t v x y z & ⁊ ƿ þ ð æ
Middle English: a b c d e f g h i k l m n o p q r ſ/s t v x y z ȝ ƿ þ ð æ
Futhorc (transliterated): f u þ o r c ȝ w h n i j eo p x s t b e m l ŋ d œ a æ y ea io cw k st g
Greek: α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ/ς τ υ φ χ ψ ω
Cyrillic: а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я
Cherokee: Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ Ꮉ Ꮊ Ꮋ Ꮌ Ꮍ Ꮎ Ꮏ Ꮐ Ꮑ Ꮒ Ꮓ Ꮔ Ꮕ Ꮖ Ꮗ Ꮘ Ꮙ Ꮚ Ꮛ Ꮜ Ꮝ Ꮞ Ꮟ Ꮠ Ꮡ Ꮢ Ꮣ Ꮤ Ꮥ Ꮦ Ꮧ Ꮨ Ꮩ Ꮪ Ꮫ Ꮬ Ꮭ Ꮮ Ꮯ Ꮰ Ꮱ Ꮲ Ꮳ Ꮴ Ꮵ Ꮶ Ꮷ Ꮸ Ꮹ Ꮺ Ꮻ Ꮼ Ꮽ Ꮾ Ꮿ Ᏸ Ᏹ Ᏺ Ᏻ Ᏼ

BTW, those are also good examples why “ever hardcoding [a-z] into your program is always wrong, sometimes.” It’s full of idiotic and even insulting assumptions. Note that all but the last three of these are actually considered Latin alphabets! That’s the same script as we use in English. In representing English text, I’ve variously had to deal with learnèd, Æneid, poﬅ, Laȝamon, résumé, 1ˢᵗ, MᶜKinley, Van Dĳke, Cañon City Colorado, œnology, ǲur, rôle, ⅷ, première, Bjørn, naïve, coöperate, façade, café, Merððyn, archæology, and even tschüß. Repeat the mantra: “Hardcoding [a-z] into your program is always wrong, sometimes.” Just Say No!

The Unicode::Collate::Locale module handles local sorting conventions. Just as English phonebooks and bookshelves have special ways of sorting names so that it doesn’t metter whether you’ve spelt something McBride or MacBride, the German-speaking world sorts their names such that Händel and Haendel are the same. That’s why without diacritics, one must obligatorily write über‑ as ueber‑ and Übermensch as Uebermensch. A locale sort knows to do this:

use utf8;
use Unicode::Collate::Locale;
@names = qw[ sát sot sät sét sæt ssét sat tot ßet SET set seat ſAT ſet saet SSET ];

$coll = new Unicode::Collate::Locale::
            locale             => de__phonebook,
            upper_before_lower => 1,
        ;

@sorts = $coll->sort(@names);
print "@sorts\n";

now produces

saet sæt sät sat sát ſAT seat SET set sét ſet sot SSET ssét ßet tot

Se habla castellano

It’s remarkable how different from one’s own other countries’ locale conventions can be. In the Spanish locale ("es"), ñ is a letter that comes after n and before o. That means that the correct sort of

raña rastrillo radio rana rápido ráfaga ranúnculo

radio ráfaga rana raña ranúnculo rápido rastrillo

Say those all really fast with a fully-rolled rr to loosen your tongue. :)

The "es__traditional" locale is a little different; historically, chocolate came after color in the Spanish dictionary, unlike the way it works in Enlgish. That’s because ch came after c and before d, while ll came after l and before m. That means that this sequence:

lástima laña llama ligante
cidra caliente color chocolate con churros
pero pera Perú perro periglo peste

sorts to

caliente cidra color con chocolate churros 
laña lástima ligante llama 
pera periglo pero perro Perú peste

Use a custom sort which first compares the items based on their lowercased representations (so that all variations of "jane" appear before variations of "jim"), then resolves ties by doing a default ASCII comparison (where uppercase comes before lowercase):

perl -e 'print join "\n", sort { lc $a cmp lc $b || $a cmp $b } qw( jim JANE jane JIM )'

Output:

JANE
jane
JIM
jim

Try:

@list = ("jane","JIM","JANE","jim");
print sort { uc $a cmp uc $b or $a cmp $b } @list;

继续阅读：hash perl sorting

sort upper case just before lowercase key values from a hash

Unicode Collation

Simple Compares

Fancier Alphabetic Sorts of Unicode

Locale Sorts

Se habla castellano

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Unicode Collation

Simple Compares

Fancier Alphabetic Sorts of Unicode

Locale Sorts

Se habla castellano

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？