Why (and how) to split column using master..spt_values?
Subquestioning the answer to question "Split one column into mult开发者_运维知识库iple rows" which I re-wrote here as [ 1 ].
What is the (meaning of) Type = 'P'
and why to use undocumented master..spt_values for splitting a column? What is the benefit of it?
[ 1 ]
CREATE TABLE dbo.Table1
(
Col1 CHAR(1),
Col2 CHAR(1),
Col3 CHAR(1),
Col4 VARCHAR(50)
)
GO
INSERT INTO dbo.Table1 VALUES ('A','B','C','1,2,3')
GO
INSERT INTO dbo.Table1 VALUES ('D','E','F','6,7,8,9')
GO
SELECT
T.col1, RIGHT(LEFT(T.col4,Number-1),
CHARINDEX(',',REVERSE(LEFT(','+T.col4,Number-1))))
FROM
master..spt_values,
table1 T
WHERE
Type = 'P' AND Number BETWEEN 1 AND LEN(T.col4)+1 AND
(SUBSTRING(T.col4,Number,1) = ','
-- OR SUBSTRING(T.col4,Number,1) = '') --this does not work correctly anyway
Related question:
- What is the purpose of system table table master..spt_values and what are the meanings of its values?
Purpose
Why use undocumented master..spt-values
Sybase, and therefore its bastard son MS SQL, provide various features and functions for the product, that is implemented in system procedures (as opposed to the binaries like sqlserver, which are started as a service). These system procedures procedures are written in SQL code and named sp_%.
Except for some secret internals, they have the same limitations and needs as any other SQL code. They are part of the Sybase ASE or SQL Server product. As such, they are not required to document it; and the internal bits cannot be reasonably labelled as "undocumented".
master..spt_values
contains all the various bits and pieces that the said system procedures need, in an SQL table, to produce the various reports. The sp
means system procedure; spt
means tables for system procedures; and of course values
is the content.
Lookup Tables
What is the (meaning of) Type = 'P'
People often describe spt_values
as "de-normalised", but that is the incorrect term. The correct term is folded, or packed. It is 26 or so logical Lookup tables, each beautifully Normalised, folded into one physical table, with a Type
column to differentiate the logical tables.
Now in a normal database, that would be a gross error (just look at the answers for "one lookup table or many"). But in a server catalogue, it is desirable, it replaces 26 physical tables.
"L" stands for for LockType Lookup; "V" stands for DeviceType Lookup (V is short for Device throughout the server); etc. Type "P2" contains bitwise ordinals, for the expansion of bits that are packed into an INT.
A set of consecutive numbers within known bounds, that is available in the form of an SQL table is required, in order to perform a Projection, which many of the system procedures have to do. Type "P" is a list of consecutive numbers between 0 and 2047.
The term Projection is used here as the technically precise meaning, the natural logical sense, not the relational algebra meaning, which is unnatural.
There is therefore just one purpose for spt_values,
to contain 26 folded, otherwise separate, Reference tables, and one Projection table.
Expansion
The ordinary use of spt_values
then, is as an ordinary Lookup or Reference or ENUM
table. First, the Lookup values:
SELECT * -- list Genders
FROM Gender
It is used in the same way that Person has a GenderCode that needs to be expanded (very expanded, these freaky days):
SELECT P.*, -- list Person
G.Name -- expand GenderCode to Name
FROM Person P
JOIN Gender G
ON P.GenderCode = G.GenderCode
Eg. sp_lock
produces a report of active locks, displaying lock types as string names. But master..syslocks
contains lock types as numbers, it does not contain those names; and if it did, it would be a badly denormalised table ! If you execute the query (Sybase ASE code, you will have to convert):
SELECT * -- list LockTypes
FROM master..spt_values
WHERE type = "L"
you will notice 66 LockType numbers and names in the Lookup table. That allows sp_lock
to execute simple code like Person::Gender above:
SELECT spid, -- list Active Locks
DB_NAME(dbid),
OBJECT_NAME(id, dbid),
v.name, -- expand lock name
page,
row
FROM master..syslocks L,
master..spt_values LT
WHERE L.type = LT.number --
AND type = "L" -- LockType Lookup table
ORDER by 1, 2, 3, 4, 5, 6 -- such that perusal is easy
Projection
What is the (meaning of) Type = 'P' ?
What is Projection and how is it used ?
Say, for example, instead of the active locks produced by the query above, you wanted a list of all 66 LockTypes, showing the number of active locks (or Null). You don't need a cursor, or a WHILE
loop. We could Project the LockType Lookup table, through the count of active locks:
SELECT LT.name, -- list LockTypes
[Count] = ( -- with count
SELECT COUNT(*)
FROM master..syslocks
WHERE type = LT.number
)
FROM master..spt_values LT
WHERE type = "L"
There are several methods, that is just one. Another method is to use a Derived Table instead of the Subquery. But you still need the Projection.
That is typically what spt_values
is used for, either Expansion or Projection. Now that you know it is there, you can use it too. It is safe (in the master
database) and used by virtually all the system procedures, which means the system procedures cannot run without it.
for splitting a column?
Ah, you do not understand the "Split one CSV column into multiple rows" code.
Forget about
spt_values
for a moment, and examine that code again. It just needs a list of consecutive numbers, so that in can step through the list of values in the CSV column, byte by byte. The code is activated only for each byte that is a comma, or end-of-string.Where to get a set of consecutive numbers in the form of an SQL table, rather than CREATing one from scratch and INSERTing into it? Why,
master..spt_values
of course. If you know it is there.(You can learn a bit about the internals of ASE or SQL Server, just by reading the code of the system stored procedures.)
Note that any CSV field in one column is a gross Normalisation error, it breaks 2NF (contains repeating values) and 1NF (not atomic). Note, that is not packed or folded, it is a repeating group, it is un-normalised. One of the many negative consequences of such a gross error is, instead of using simple SQL to navigate the repeating group as rows, one has to use complex code to determine and extract the content of the un-normalised CSV field. Here
spt_values P
provides a vector for that complex code, making it easier.
What is the benefit of it?
I think I have answered that. If you did not have it, every system procedure that requires a list of Numbers would have to CREATE a temp table; and INSERT the rows into it; before running its code. Of course, not having to perform those steps, makes the system procedures much faster.
Now, when you need to perform a Projection, eg. calendar dates in the future, or whatever, you can use spt_values
, instead of having to create your own temp table each time (or create your own private permanent table and maintain it).
Many common solutions for splitting strings in TSQL require a list of numbers; in this case, someone is using the spt_values table to provide them. By inspection, this query returns a list of 2048 consecutive integers:
select number from master..spt_values where type = 'P'
I assume that the original query writer used spt_values for the integers because a) it is 'guaranteed' to be available and therefore the query will always work, and b) it avoids a lengthy explanation about alternative ways to get the integers.
The major disadvantage is that the table is undocumented and therefore using it is potentially confusing and also somewhat risky (at least in principle, an upgrade or service pack might change the table data or structure, or even remove it completely).
There are many alternative ways to get a list of numbers without using undocumented tables (I use a table-valued function):
SQL, Auxiliary table of numbers
I know this is an older post but thought I'd add an update. Tally Table and cteTally table based splitters all have a major problem. They use concatenated delimiters and that kills their speed when the elements get wider and the strings get longer.
I've fixed that problem and wrote an article about it which may be found at he following URL. http://www.sqlservercentral.com/articles/Tally+Table/72993/
The new method blows the doors off of all While Loop, Recursive CTE, and XML methods for VARCHAR(8000).
I'll also tell you that a fellow by the name of "Peter" made an improvement even to that code (in the discussion for the article). The article is still interesting and I'll be updating the attachments with Peter's enhancements in the next day or two. Between my major enhancement and the tweek Peter made, I don't believe you'll find a faster T-SQL-Only solution for splitting VARCHAR(8000). I've also solved the problem for this breed of splitters for VARCHAR(MAX) and am in the process of writing an article for that, as well.
Now it is working fine
SELECT T.col1, RIGHT(LEFT(T.col4,Number-1),CHARINDEX(',',REVERSE(LEFT(','+T.col4,Number-1))))
FROM
master..spt_values,
table123 T
WHERE
Type = 'P' AND Number BETWEEN 1 AND LEN(T.col4)+1 AND
(SUBSTRING(T.col4,Number,1) = ','
OR SUBSTRING(T.col4,Number,1) = '')
In SQL Server 2016 now we have new functionality String_Split which we can use to split columns.
For example, here is the script:
DECLARE @String NVARCHAR(1000) = 'abc,def,ghi,jkl,mno,pqr,stu,vw,xyz';
SELECT * FROM STRING_SPLIT(@String,',');
Upon execution of above script, it will return us following result.
Well, that's it. Here is resource by Kathi, where she has compared performance.
精彩评论