Does an nvarchar always store each character in two bytes?
I had (perhaps naively) assumed that in SQL Server, an nvarchar would stor开发者_Python百科e each character in two bytes. But this does not always seem to be the case. The documentation out there suggests that some characters might take more bytes. Does someone have a definitive answer?
yes it does it uses 2 bytes, use datalength to get the storage size, you can't use LEN because LEN just counts the characters, see here: The differences between LEN and DATALENGTH in SQL Server
DECLARE @n NVARCHAR(10)
DECLARE @v VARCHAR(10)
SELECT @n = 'A', @v='A'
SELECT DATALENGTH(@n),DATALENGTH(@v)
---------
2 1
Here is what Books On Line has: http://msdn.microsoft.com/en-us/library/ms186939.aspx
Character data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.
nchar [ ( n ) ]
Fixed-length Unicode character data of n characters. n must be a value from 1 through 4,000. The storage size is two times n bytes. The ISO synonyms for nchar are national char and national character.
nvarchar [ ( n | max ) ]
Variable-length Unicode character data. n can be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for nvarchar are national char varying and national character varying.
That said unicode compression was introduced in SQL Server 2008 R2 so it might store ascii as 1 byte, you can read about unicode compression here
- SQL Server 2008 R2 : A quick experiment in Unicode Compression
- SQL Server 2008 R2 : Digging deeper into Unicode compression
- More testing of Unicode Compression in SQL Server 2008 R2
Given that there are more than 65536 characters, it should be obvious that a character cannot possibly fit in just two octets (i.e. 16 bits).
SQL Server, like most of Microsoft's products (Windows, .NET, NTFS, …) uses UTF-16 to store text, in which a character takes up either two or four octets, although as @SQLMenace points out, current versions of SQL Server use compression to reduce that.
My understanding of this issue is that SQL server uses UCS-2 internally, but that its UCS-2 implementation has been hacked to support a subset of characters of up to 4 bytes in the GB18030 character set, which are stored as UCS-2 but are transparently converted by the database engine back to multibyte characters when queried.
Surrogate/supplementary characters aren't fully supported - the implementation of a number of SQL server string functions doesn't support surrogate pairs, as detailed here.
精彩评论