What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?

2023-02-25 04:07 问答作者：

It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed开发者_JS百科-length character encoding, for nchar/nvarchar fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.)

While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to allow for surrogate pairs.

If I write a C# string to an SQL Server nvarchar (UCS-2) field and read it back, will this always return the same result?

It seems that while UTF-16 is a superset of UCS-2 in the sense that UTF-16 encodes more code points (e.g. above 0xFFFF), it's actually a sub-set of UCS-2 at the 2-byte level, since it's more restrictive.

To answer my own question, I suspect that if my C# string contains code points above 0xFFFF (represented by pairs of characters), these would be stored and retrieved just fine in the database, but if I tried to manipulated them in the database (e.g. perhaps calling TOUPPER or attempting to blank-out every other character), then I could run into some problems displaying the string later... unless SQL Server has functions that acknowledge surrogate pairs and effectively treat nchar/nvarchar strings as UTF-16.

It's all a bit of a fudge really.

First the similarities

The SQL Server nchar/nvarchar/ntext data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).
The CLR String data type also stores text as a string of 2-byte Chars. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).

Now the differences

.NET allows you to access the actual Unicode code points in a CLR string via the StringInfo class.
.NET has tons of support for encoding and decoding text data in a variety of encodings. When converting an arbitrary byte stream to a String, it will always encode the string as UTF-16 (with full multilingual plane support).

In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.

I don't expect that treating the text as UCS-2 would cause many problems.

Case conversions should not be a problem, because (AFAIK) there are no case mappings above the BMP (except the identity mapping, of course!), and, obviously, the surrogate characters are going to map to themselves.

Blanking every other character is just asking for trouble. In reality, doing these sorts of transformations without consideration of the character values is always a dangerous activity. I can see it happening legitimately with string truncations. But if any unmatched surrogates show up in the result, this itself is not a huge problem. Any system that receives such data—and cares—will probably just replace the unmatched surrogate with a replacement character, if it bothers to do anything about it at all.

Obviously, string length is going to be bytes/2 rather than number-of-characters, but number-of-characters is not a very useful value anyway, once you start plumbing the depths of the Unicode code charts. For example, you aren't going to get good results in monospaced display once you leave the ASCII range, because of combining characters, RTL languages, directional control characters, tags, and several kinds of space characters. The high code points are going to be the least of your problems.

Just to be on the safe side, you should probably store your cuneiform texts in a different column than the archeologist's names. :D

UPDATE now with empirical data!

I just ran a test to see what happens with case transformations. I created a string with the English word TEST in uppercase twice—first in Latin script, then in Deseret script. I applied a lower-case transformation to this string in .NET and in SQL Server.

The .NET version correctly lowercased all the letters in both scripts. The SQL Server version only lowercased the Latin characters and left the Deseret characters unchanged. This meets with expectations regarding the handling of UTF-16 verses UCS-2.

using System;
using System.Data.SqlClient;

class Program
{
    static void Main(string[] args)
    {
        string myDeseretText = "TEST\U00010413\U00010407\U0001041D\U00010413";
        string dotNetLower = myDeseretText.ToLower();
        string dbLower = LowercaseInDb(myDeseretText);

        Console.WriteLine("  Original: {0}", DisplayUtf16CodeUnits(myDeseretText));
        Console.WriteLine(".NET Lower: {0}", DisplayUtf16CodeUnits(dotNetLower));
        Console.WriteLine("  DB Lower: {0}", DisplayUtf16CodeUnits(dbLower));
        Console.ReadLine();
    }

    private static string LowercaseInDb(string value)
    {
        SqlConnectionStringBuilder connection = new SqlConnectionStringBuilder();
        connection.DataSource = "(local)";
        connection.IntegratedSecurity = true;
        using (SqlConnection conn = new SqlConnection(connection.ToString()))
        {
            conn.Open();
            string commandText = "SELECT LOWER(@myString) as LoweredString";
            using (SqlCommand comm = new SqlCommand(commandText, conn))
            {
                comm.CommandType = System.Data.CommandType.Text;
                comm.Parameters.Add("@myString", System.Data.SqlDbType.NVarChar, 100);
                comm.Parameters["@myString"].Value = value;
                using (SqlDataReader reader = comm.ExecuteReader())
                {
                    reader.Read();
                    return (string)reader["LoweredString"];
                }
            }
        }
    }

    private static string DisplayUtf16CodeUnits(string value)
    {
        System.Text.StringBuilder sb = new System.Text.StringBuilder();

        foreach (char c in value)
            sb.AppendFormat("{0:X4} ", (int)c);
        return sb.ToString();
    }
}

Output:

  Original: 0054 0045 0053 0054 D801 DC13 D801 DC07 D801 DC1D D801 DC13
.NET Lower: 0074 0065 0073 0074 D801 DC3B D801 DC2F D801 DC45 D801 DC3B
  DB Lower: 0074 0065 0073 0074 D801 DC13 D801 DC07 D801 DC1D D801 DC13

Just in case anyone has a Deseret font installed, here are the actual strings for your enjoyment:

  Original: TEST

继续阅读：character-encodingcodepointsql-serverucs2utf-16


                            更多精彩内容
                            C++获取对象真实地址的方法
C语言中文件操作Error处理的方法示例
C#实现MySQL中Clickhouse建表语句的转换工具
C#中获取程序执行时间的三种方法
C#快速实现Excel多类数字格式的设置

What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？