Fastest Way to Count Distinct Values in a Column, Including NULL Values

2023-04-03 20:02 问答作者：

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.

This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.

This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).

What is the most efficient way to do this?

Update After Testing

I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest

Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)

Testing Results (in seconds):

   开发者_运维问答Method          20_Columns   5_Columns   1_Column (Large)   1_Column (Small)
1) Count/GroupBy      10.8          4.8            2.8               0.14       
2) CountDistinct      12.4          4.8            3                 0.7         
3) dense_rank        226           30              6                 4.33 
4) Count+Max          98.5         44             16                12.5

Notes:

Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.

Thanks for the help and suggestions!

SELECT COUNT(*)
FROM (SELECT ColumnName
      FROM TableName
      GROUP BY ColumnName) AS s;

GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.

I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.

;with C as
(
  select dense_rank() over(order by Col1) as dnCol1,
         dense_rank() over(order by Col2) as dnCol2
  from YourTable
)
select max(dnCol1) as CountCol1,
       max(dnCol2) as CountCol2
from C

Test the query at SE-Data

A development on OP's own solution:

SELECT
  COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable

Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)

Select Count(Distinct COLUMNNAME) +
       Case When Exists 
                 (Select * from TABLENAME Where COLUMNNAME is Null) 
            Then 1 Else 0 End
From TABLENAME

You can try:

count(
distinct coalesce(
    your_table.column_1, your_table.column_2
    -- cast them if you want replace value from column are not same type
    )
) as COUNT_TEST

Function coalesce help you combine two columns with replace not null values.

I used this in mine case and success with correctly result.

Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.

    SELECT 
        COUNT( distinct
                 (case when [testNull] is null then 'dbNullValue' else [testNull] end)
             )
    FROM [test].[dbo].[testNullVal]

With this approach can also count more than one column

    SELECT 
        COUNT( distinct
                 (case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
             ),
        COUNT( distinct
                 (case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
             )
    FROM [test].[dbo].[testNullVal]

继续阅读：asp.net database sql sql-server tsql

Fastest Way to Count Distinct Values in a Column, Including NULL Values

Update After Testing

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Update After Testing

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？