Case-insensitive storage and unicode compatibility

2023-03-26 15:17 问答作者：

After I heard of someone开发者_高级运维 at my work using String.toLowerCase() to store case-insensitive codes in a database for searchability, I had an epic fail moment thinking about the number of ways that it can go wrong:

Turkey test (in particular changing locales on the running computer)
Unicode version upgrades - I mean, who knows about this stuff? If I upgrade to Java 7, I have to reindex my data if I'm being case-insensitive?

What technologies are affected by Unicode versions?

Do I need to worry about Oracle or SQL Server (or other vendors) changing their unicode versions and resulting in one of my locales not resulting in the same lower or upper character conversion?

How do I manage this? I'm tempted by the "simplicity" of ensuring I use the database conversion, but when there's an upgrade it'll be the same sort of issue.

You do not want to store the lowercase version of a string "for searchability"!!

That is the wrong approach altogether. You are making unjust and incorrect assumptions about how Unicode casing works.

This is why Unicode defines a separate thing called a casefold for a string, distinct from the three different cases (lowercase, titlecase, and uppercase).

Here are ten different examples where you will do the wrong thing if you use the lowercase instead of the casefold:

ORIGINAL        CASEFOLD        LOWERCASE   TITLECASE  UPPERCASE
========================================================================
eﬃcient         efficient       eﬃcient       Eﬃcient         EFFICIENT       
ﬂour            flour           ﬂour           Flour           FLOUR           
poſt            post            poſt           Poſt            POST            
poﬅ             post            poﬅ             Poﬅ            POST            
ﬅop             stop            ﬅop            Stop            STOP            
tschüß          tschüss         tschüß         Tschüß         TSCHÜSS         
weiß            weiss           weiß           Weiß            WEISS           
WEIẞ            weiss           weiß            Weiß           WEIẞ            
στιγμας         στιγμασ         στιγμας         Στιγμας         ΣΤΙΓΜΑΣ 
ᾲ στο διάολο    ὰι στο διάολο   ᾲ στο διάολο    Ὰͅ Στο Διάολο   ᾺΙ ΣΤΟ ΔΙΆΟΛΟ

And yes, I know the plural of stigma is stigmata not stigmas; I am trying to show the final sigma issue. Both ς and σ are valid lowercase versions of the uppercase sigma, Σ. If you store “just the lowercase”, then you will get the wrong thing.

If you are using Java’s Pattern class, you must specify both CASE_INSENSITIVE and UNICODE_CASE, and you still will not get these right, because while Java uses full casemapping, it uses only simple casefolding. This is a problem.

As for the Turkic languages, yes, it is true that there is a special casefold for Turkic. For example, İstanbul has a Turkic casefold of just ı̇stanbul instead of the i̇stanbul that you are supposed to get. Since I am sure those will not look right to you, I’ll spell it out with named characters for the non-ASCII; in plainer terms, "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}stanbul" has a Turkic casefold of "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING DOT ABOVE}stanbul" rather than "i\N{COMBINING DOT ABOVE}stanbul" that you normally get.

Here are a couple more table rows if you’re writing a regression testing suite:

[ "Henry Ⅷ", "henry ⅷ", "henry ⅷ", "Henry Ⅷ", "HENRY Ⅷ",  ],
[ "I Work At Ⓚ",  "i work at ⓚ",  "i work at ⓚ", "I Work At Ⓚ", "I WORK AT Ⓚ", ],
[ "ʀᴀʀᴇ", "ʀᴀʀᴇ", "ʀᴀʀᴇ", "Ʀᴀʀᴇ", "ƦᴀƦᴇ",  ],
[ "Ԧԧ", "ԧԧ", "ԧԧ", "Ԧԧ", "ԦԦ",   ],
[ "

继续阅读：compatibilityunicode


                            更多精彩内容
                            Golang配置管理Viper的实现
go动态限制并发数量的实现示例
golang调用dll的接口三种方式小结
Go语言中的多种测试方法
Go语言sync.Once和sync.Cond的实现

Case-insensitive storage and unicode compatibility

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？