How can I create a series of months to join sparse data to?
I think this is a pretty common issue, but I don't know what the process is called, so I'll describe it with an example. The concept is that I want to join a sparse dataset to a complete series, such as the days of the week, months of the year, or any ordered set (for example, for ranking). Empty positions in the sparse data will show as NULL alongside the complete series.
Let's say I run the following query in SQL Server to find out monthly sales.
SELECT
YEAR([timestamp]),
MONTH([timestamp]),
COUNT(*)
FROM table1
WHERE YEAR([timestamp]) = YEAR(GETDATE())
GROUP BY
YEAR([timestamp]),
MONTH([timestamp])
ORDER BY
YEAR([timestamp]) DESC,
MONTH([timestamp]) DESC;
If, however, sales only occurred in May and August of this year, for example, then the return result would look like this:
2010 August 1234
2010 May 5678
I want my return result set to look like this:
2010 January
2010 February
2010 March
2010 April
2010 May 1234
2010 June
2010 July
2010 August 5678
2010 September
2010 October
2010 November
2010 December
The only way I know to do this is this:
SELECT
YEAR(GETDATE()),
month_index.month_name,
sales_data.sales
FROM (
SELECT 'January' as month_name, 1 as month_number
UNION
SELECT 'February', 2
UNION
SELECT 'March', 3
UNION
SELECT 'April', 4
UNION
SELECT 'May', 5
UNION
SELECT 'June', 6
UNION
SELECT 'July', 7
UNION
SELECT 'August', 8
UNION
SELECT 'September', 9
UNION
SELECT 'October', 10
UNION
SELECT 'November', 11
UNION
SELECT 'December', 12
) as month_index
LEFT JOIN (
SELECT
YEAR([timestamp]) AS year_name,
MONTH([timestamp]) AS month_name,
COUNT(*) AS sales
FROM table1
WHERE YEAR([timestamp]) = GETDATE()
GROUP BY
YEAR([timestamp]),
MONTH([timestamp])
) AS sales_data
ON month_index.month_name = sales_data.month_name
ORDER BY
month_index.month_number DESC;
开发者_高级运维Is there a better way to create complete date and alphanumeric series to join data onto? And what is this called?
Thanks!
try something like this:
DECLARE @StartDate datetime
,@EndDate datetime
SELECT @StartDate=DATEADD(month,-6,DATEADD(month,DATEDIFF(month,0,GETDATE()),0) )
,@EndDate=GETDATE()
;with AllDates AS
(
SELECT @StartDate AS DateOf
UNION ALL
SELECT DateAdd(month,1,DateOf)
FROM AllDates
WHERE DateOf<@EndDate
)
SELECT * FROM AllDates
output:
DateOf
-----------------------
2009-12-01 00:00:00.000
2010-01-01 00:00:00.000
2010-02-01 00:00:00.000
2010-03-01 00:00:00.000
2010-04-01 00:00:00.000
2010-05-01 00:00:00.000
2010-06-01 00:00:00.000
2010-07-01 00:00:00.000
(8 row(s) affected)
Queries like this are one of the main reasons that many experienced DBAs or database programmers keep a calendar table in their databases.
I'm with KM, that for SQL Server 2005+ you could use a recursive CTE:
WITH months AS (
SELECT DATENAME(mm, '2010-01-01') AS month_name,
MONTH('2010-01-01') AS month_number,
CAST('2010-01-01' AS DATETIME) AS dt
UNION ALL
SELECT DATENAME(mm, DATEADD(mm, 1, m.dt)),
MONTH(DATEADD(mm, 1, m.dt)),
DATEADD(mm, 1, m.dt)
FROM months m
WHERE DATEADD(mm, 1, m.dt) <= '2010-12-01')
SELECT x.month_name,
y.*
FROM months x
LEFT JOIN your_table y ON MONTH(y.date) = x.month_number
After all, the last time KM & chatted about this - we found a recursive CTE to be slightly more efficient than using a numbers table.
I like this approach to build the months table:
SELECT
DATENAME(mm, date_val) AS month_name,
MONTH(date_val) AS month_number,
date_val as dt
FROM (
SELECT DATEADD(mm, number, '2010-01-01') AS date_val
FROM master.dbo.spt_values
WHERE type = 'P'
AND number BETWEEN 0 AND 11
) months
Based on my tests, it's faster than a CTE. I'm running SQL Server 2008 Express.
Here are the test results, using SET STATISTICS IO ON and SET STATISTICS TIME ON
CTE:
(12 row(s) affected)
Table 'Worktable'. Scan count 2, logical reads 73, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 64 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Subquery:
(12 row(s) affected)
Table 'spt_values'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 4 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Although your original question is asking what this is called. I don't know a name for it; maybe something like "left outer join against a series?"
One additional part to add: when you join against the months table, or even when do your original query, it's typically recommended to avoid using a function like YEAR([timestamp]) on the left side of your WHERE clause.
So this code:
SELECT
YEAR([timestamp]),
MONTH([timestamp]),
COUNT(*)
FROM table1
WHERE YEAR([timestamp]) = YEAR(GETDATE())
GROUP BY
YEAR([timestamp]),
MONTH([timestamp])
...will cause an index scan (assuming timestamp is indexed) because of the YEAR([timestamp]) must be evaluted for every row. On a 1m+ row table, this will mean poor performance.
So you will typically see a recommendation like this instead:
SELECT
YEAR([timestamp]),
MONTH([timestamp]),
COUNT(*)
FROM #table1
WHERE [timestamp] >= DATEADD(YY, DATEDIFF(YY, 0, GETDATE()), 0) -- First day of this year
AND [timestamp] < DATEADD(YY, DATEDIFF(YY, 0, GETDATE()) + 1, 0) -- First day of next year
GROUP BY
YEAR([timestamp]),
MONTH([timestamp])
This will use an index seek (again, assuming timestamp is an indexed column) and result in fewer logical reads and thus a faster response. This can be confirmed by checking the execution plan.
how about making a new table called Months: then populating it with data you can join to?
精彩评论