SQL Distinct keyword bogs down performance?

2023-02-17 23:17 问答作者：

I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.

I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?

Here's the query:

SELECT DISTINCT
    c.username, o.orderno, o.totalcredits, o.totalrefunds,
    o.recstatus,开发者_开发技巧 o.reason 
FROM management.contacts c 
    JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')

Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.

Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).

Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.

What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)

Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)

Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).

Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.

If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.

Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.

Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.

Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.

You can try to make a group by like this:

  SELECT c.username, 
         o.orderno, 
         o.totalcredits, 
         o.totalrefunds,
         o.recstatus, 
         o.reason
    FROM management.contacts c,
         management.orders o
   WHERE c.custID = o.custID
     AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username, 
         o.orderno, 
         o.totalcredits, 
         o.totalrefunds,
         o.recstatus, 
         o.reason

Also verify if you have index on o.recDate

继续阅读：plsql sql

SQL Distinct keyword bogs down performance?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？