开发者

Why does SQL Server consider N'㐢㐢㐢㐢' and N'㐢㐢㐢' to be equal?

We are testing our开发者_StackOverflow社区 application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.

On both Latin and Japanese-collated systems the following equality is true (U+3422):

N'㐢㐢㐢㐢' = N'㐢㐢㐢'

but the following is not (U+30C1):

N'チチチチ' = N'チチチ'

This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?


Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.

Sorting it all Out: The jury will give this string no weight

In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS or Chinese_Simplified_Stroke_Order_100_CI_AS:

You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.

So the following SQL statement would work as expected:

select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;

A list of all supported collations can be found in MSDN:

SQL Server 2008 Books Online: Windows Collation Name


That character U+3422 is from the CJK Unified Ideographs tables, which are a relatively obscure (and politically loaded) part of the unicode standard. My guess is that SQL Server simply does not know that part - or perhaps even intentionally does not implement it due to political considerations.

Edit: looks like my guess was wrong and the real problem was that neither Latin nor Japanese collation define weights for that character.


If you look at the Unihan data page, the character appears to only have the "K-Source" field which corresponds to the South Korean government's mappings.

My guess is that MS SQL asks "is this character a Chinese character?" If so then use the Japanese sorting standard, discarding the character if the collation number isn't available - likely a SQL server-specific issue.

I very much doubt it's a political dispute as another poster suggested as the character doesn't even have a Taiwan or Hong Kong encoding mapping.

More technical info: The J-Source (the Japanese sorting order prescribed by the Japanese government) is blank as it probably was only used in classical Korean Hanja (chinese characters which are now only used in some contexts.)

The Japanese government's JIS sorting standards generally sort Kanji characters by the Japanese On reading (which is usually the approximated Chinese pronunciation when the characters were imported into Japan.) But this character probably isn't used much in Japanese and may not even have a Japanese pronunciation to associate with it, so hasn't been added to the data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜