Statistics and Cardinality Estimation - Why am I seeing this result?
I came across this little issue when trying to solve a more complex problem and have gotten to the end of my rope with trying to figure the optimizer out. So, let's say I have a table called `MyTable' that can be defined like this:
CREATE TABLE MyTable (
GroupClosuresID int identity(1,1) not null,
SiteID int not null,
DeleteDateTime datetime null
, CONSTRAINT PK_MyTable PRIMARY KEY (GroupClosuresID, SiteID))
This table has 286,685 rows in it and running DBCC SHOW_STATISTICS('MyTable','PK_MyTable')
will yield:
Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows
-------------------------------------------------------------------------------------------------------------------------------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------
PK_MyTable Aug 10 2011 1:00PM 286685 286685 18 0开发者_如何学Go.931986 8 NO NULL 286685
(1 row(s) affected)
All density Average Length Columns
------------- -------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3.743145E-06 4 GroupClosuresID
3.488149E-06 8 GroupClosuresID, SiteID
(2 row(s) affected)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
1 0 8 0 1
129 1002 7 127 7.889764
242 826 6 112 7.375
531 2010 6 288 6.979167
717 1108 5 185 5.989189
889 822 4 171 4.807017
1401 2044 4 511 4
1763 1101 3 361 3.049861
14207 24780 1 12443 1.991481
81759 67071 1 67071 1
114457 31743 1 31743 1
117209 2047 1 2047 1
179109 61439 1 61439 1
181169 1535 1 1535 1
229410 47615 1 47615 1
235846 2047 1 2047 1
275456 39442 1 39442 1
275457 0 1 0 1
Now I run a query on this table with no additional indexes or statistics having been created.
SELECT GroupClosuresID FROM MyTable WHERE SiteID = 1397 AND DeleteDateTime IS NULL
Two new statistics objects now appear, one for the SiteID
column and the other for DeleteDateTime
column. Here they are respectively (Note: Some non-relevant information has been excluded):
Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows
-------------------------------------------------------------------------------------------------------------------------------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------
_WA_Sys_00000002_7B0C223C Aug 10 2011 1:15PM 286685 216605 200 0.03384706 4 NO NULL 286685
(1 row(s) affected)
All density Average Length Columns
------------- -------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.0007380074 4 SiteID
(1 row(s) affected)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
.
.
.
1397 59.42782 16005.02 5 11.83174
.
.
.
Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows
-------------------------------------------------------------------------------------------------------------------------------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------
_WA_Sys_00000006_7B0C223C Aug 10 2011 1:15PM 286685 216605 201 0.7447883 0.8335911 NO NULL 286685
(1 row(s) affected)
All density Average Length Columns
------------- -------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.0001065871 0.8335911 DeleteDateTime
(1 row(s) affected)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
----------------------- ------------- ------------- -------------------- --------------
NULL 0 255827 0 1
.
.
.
The execution plan generated for the query I ran above gives me no surprises. It consists of a simple Clustered Index Scan with 14282.3 estimated rows and 15676 actual rows. From what I've learned about statistics and cost estimation, using the two histograms above we can multiply the selectivity of SiteID (16005.02 / 286685) times the selectivity of DeleteDateTime (255827 / 286685) to get a composite selectivity of 0.0498187307480119. Multiplying that times the total number of rows (286685) gives us the exact same thing the optimizer did: 14282.3.
But here is where I get confused. I create an index with CREATE INDEX IX_MyTable ON Mytable (SiteID, DeleteDateTime)
which creates its own statistics object:
Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows
-------------------------------------------------------------------------------------------------------------------------------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------
IX_MyTable Aug 10 2011 1:41PM 286685 286685 200 0.02749305 8.822645 NO NULL
286685
(1 row(s) affected)
All density Average Length Columns
------------- -------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.0007107321 4 SiteID
7.42611E-05 4.822645 SiteID, DeleteDateTime
3.488149E-06 8.822645 SiteID, DeleteDateTime, GroupClosuresID
(3 row(s) affected)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
.
.
.
1397 504 15686 12 42
.
.
.
When I run the same query as before (SELECT GroupClosuresID FROM MyTable WHERE SiteID = 1397 AND DeleteDateTime IS NULL
) I still get 15676 rows returned, but my estimated row count is now 181.82.
I've tried manipulating numbers to try and figure out where this estimation is coming from but I just can't get it. I have to assume it is related to the density values for IX_MyTable.
Any help would be greatly appreciated. Thanks!!
EDIT: Here is the execution plan for that last query execution.
This one took some digging!
It's a product of:
NULL
density in your date field (from your first set of stats255827/286685 = .892363
- ...times the density of the first field (
siteid
) in your new index:0.0007107321
The formula is:
.00071017321 * 286685 = 203.7562
-- est. rows with your value in siteid based on even distribution of values
255827 / 286685 = 0.892363
-- Probability of a NULL across all rows
203.7562 * 0.892363 = 181.8245
I'm guessing that since the row count in this instance doesn't actually affect anything, the optimizer took the easiest route and just multiplied the probabilities together.
Just wanted to write about it, but JNK was first.
Basically hash function calculates results now for two columns. And hash function result for SiteID = 1397 AND DeleteDateTime IS NULL matches approx 181 rows.
http://en.wikipedia.org/wiki/Hash_table#Hash_function
精彩评论