Optimizing Numbers Table Creation on SQL Server?
Running SQL Server Express 2008. I created a "numbers table" for some utility functions. Since the table population is part of an automated build, this is taking an inordinate amount of time each time the thing is deployed.
At the risk of "over-optimizing", can anyone comment on how I can make this go as fast as possible? Maybe playing with the index fill f开发者_如何转开发actor or when the PK is created?
IF EXISTS (SELECT * FROM dbo.sysobjects
WHERE id = OBJECT_ID(N'Numbers') AND OBJECTPROPERTY(id, N'IsUserTable') = 1)
BEGIN
drop TABLE [Numbers]
end
CREATE TABLE [Numbers]
(
[Number] [int]
, CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED
(
[number] ASC
) ON [PRIMARY]
)
ON [PRIMARY]
Declare @cnt int
Select @cnt=0
SET NOCOUNT ON
while (@cnt<10000)
BEGIN
INSERT INTO NUMBERS(NUMBER) SELECT @cnt
SELECT @cnt=@cnt+1
end
SQL, Auxiliary table of numbers by Jeff Moden. This one was the fastest
--===== Itzik's CROSS JOINED CTE method
WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
E02(N) AS (SELECT 1 FROM E00 a, E00 b),
E04(N) AS (SELECT 1 FROM E02 a, E02 b),
E08(N) AS (SELECT 1 FROM E04 a, E04 b),
E16(N) AS (SELECT 1 FROM E08 a, E08 b),
E32(N) AS (SELECT 1 FROM E16 a, E16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
SELECT N
INTO #Tally4
FROM cteTally
WHERE N <= 1000000;
GO
Got to do some tests wrapping things in functions and using different scenarios to fill a table.
Options
Mikael's proposed solution can be improved by removing the WHERE clause WHERE N <= @count
and adding a TOP clause TOP (@count)
. That reduces it's execution time almost six-fold for me in the following:
This is the fastest algorithm! Now you know it :o)
create function dbo.Numbers2(@count bigint)
RETURNS TABLE RETURN
--===== Itzik's CROSS JOINED CTE method
WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
E02(N) AS (SELECT 1 FROM E00 a, E00 b),
E04(N) AS (SELECT 1 FROM E02 a, E02 b),
E08(N) AS (SELECT 1 FROM E04 a, E04 b),
E16(N) AS (SELECT 1 FROM E08 a, E08 b),
E32(N) AS (SELECT 1 FROM E16 a, E16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
SELECT TOP (@count) N
FROM cteTally
Compared to the slower adaptation:
create function dbo.Numbers3(@count bigint)
RETURNS TABLE RETURN
--===== Itzik's CROSS JOINED CTE method
WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
E02(N) AS (SELECT 1 FROM E00 a, E00 b),
E04(N) AS (SELECT 1 FROM E02 a, E02 b),
E08(N) AS (SELECT 1 FROM E04 a, E04 b),
E16(N) AS (SELECT 1 FROM E08 a, E08 b),
E32(N) AS (SELECT 1 FROM E16 a, E16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
SELECT N
FROM cteTally
WHERE N <= @count
The solution proposed in https://stackoverflow.com/a/14386663/481812 has similar performance to Numbers2:
create function dbo.Numbers(@count bigint)
RETURNS TABLE RETURN
with byte (n) as (
-- shortened, see the other answer for full code
Tests
- Inserting into a temp heap is fastest (600),
- inserting into temp table with PK clustered index slower (3000) and finally
- filling a physical table with index the slowest (10000). Additional comparison is made with
- filling a temp table with identity PK, as Gregory suggested, from the same CTE style rows generation is pretty much the same (3000) as clustered temp #2 in performance.
- Gregory's proposal was too slow to consider (21000), though. To amaze you,
- the original example would not be realistic to measure with a million rows; 10000 was used instead, which is 100 times less than in the other tests, to bring it down to reasonable time (15000). Assuming it would scale proportionally - best case scenario - it should theoretically complete 1m rows in 25 minutes (1500000).
Ben's suggestion (tried) would probably make a difference if data were not sorted - here we're inserting sorted data, hence always appending, no performance gain building the index after.
Test group 1 - temp heap inserts
CREATE TABLE #Numbers ([Number] [int])
insert #Numbers select n from dbo.Numbers(1000000)
drop table #Numbers
-- 480 - 900 ms, avg 600 ms
CREATE TABLE #Numbers ([Number] [int])
insert #Numbers select n from dbo.Numbers2(1000000)
drop table #Numbers
-- 440 - 800 ms, avg 550 ms
-- just in case you wondered how it scales, 30 times more went in in 14000,
-- Numbers and Numbers2 scale linearly
CREATE TABLE #Numbers ([Number] [int])
insert #Numbers select n from dbo.Numbers3(1000000)
drop table #Numbers
-- 2700 - 4000 ms, avg 3200 ms
Test group 2 - inserting in temp cluster
CREATE TABLE #Numbers ([Number] [int] primary key)
insert #Numbers select n from dbo.Numbers(1000000)
drop table #Numbers
-- 1900 - 4000 ms, avg 3000 ms
CREATE TABLE #Numbers ([Number] [int] primary key)
insert #Numbers select n from dbo.Numbers2(1000000)
drop table #Numbers
-- 1900 - 4000 ms, avg 3000 ms
CREATE TABLE #Numbers ([Number] [int] primary key)
insert #Numbers select n from dbo.Numbers3(1000000)
drop table #Numbers
-- 4000 - 7000 ms, avg 5000 ms
Test group 3 - inserting in physical table cluster
CREATE TABLE PNumbers ([Number] [int] primary key)
insert PNumbers select n from dbo.Numbers(1000000)
drop table PNumbers
-- 7000 - 12000 ms, avg 10000 ms
CREATE TABLE PNumbers ([Number] [int] primary key)
insert PNumbers select n from dbo.Numbers2(1000000)
drop table PNumbers
-- 7000 - 12000 ms, avg 10000 ms
CREATE TABLE PNumbers ([Number] [int] primary key)
insert PNumbers select n from dbo.Numbers3(1000000)
drop table PNumbers
-- 7000 - 12000 ms, avg 10000 ms
Test 4 - filling temp clustered identity column from rows generated
CREATE TABLE #Numbers ([Number] [int] identity primary key, x bit)
INSERT INTO #Numbers(x) select 1 from dbo.Numbers2(1000000)
drop table #Numbers
-- 2000 - 4000 ms, avg 3000 ms
Test 5 - filling temp clustered identity column in a loop
CREATE TABLE #Numbers ([Number] [int] identity primary key)
declare @cnt int = 0
while (@cnt<1000000)
BEGIN
INSERT INTO #Numbers DEFAULT values
set @cnt = @cnt+1
END
drop table #Numbers
-- 20000 - 22000 ms, avg 21000 ms
Conclusions
The Numbers(n) and Numbers2(n) options perform consistently the same (best). Numbers3(n) is a way slower due to the WHERE clause. Other tested scenarios are too slow to consider.
Inserting into an unindexed temp heap is fastest. As soon as you add any index (tried) to the column (either PK clustered or secondary non unique), it's not about the number generation anymore but rather about index inserts (and possibly IO). The worst is inserting into a physical table. This must be because temp tables are not necessarily written to disk (tempdb).
Proposal
According to measurements, you'll gain by
- using CTE + JOIN + ROW_NUMBER + TOP based solution as opposed to loops or recursive CTE,
- not using an index - using a heap, and
- by using a temporary table rather than physical table.
You may avoid the physical or temp table altogether and just use the function, but then in your subsequent queries you'll be chancing the optimizer as you'll be flying without statistics on the numbers. You may then solve this with query hints which is not nice, but works... Though, looking at the execution plan, row numbers at least for dbo.Numbers are estimated correctly.
One simple thing you can do is delay the creation of the index until after all the numbers are populated. In other words, remove it from CREATE TABLE and add a CREATE INDEX statement after your loop. This will prevent the index from being updated on each insert.
I am not sure I agree with the architecture, from a high level, but since you are merely sticking a single field, with numbers from 0 to 9999, you can alter the table definition so it is an auto-increment seeded to start at 0. It is then a matter of inserting the default value into the table. That should be extremely fast.
You can also prototype the table with the proper rows and then insert the 10,000 by "copying over" the rows. This means another table that does "nothing", which may or may not solve the problem.
Just two thoughts.
精彩评论