How to optimize group by in table with huge number of records

2023-02-09 22:10 问答作者：

I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same pe开发者_StackOverflow社区rson or not.

Here is the query I write

SELECT * 
FROM Person INNER JOIN
(
    SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
    FROM Person
    GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
    HAVING count(*) > 1
) as dupPersons 
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName

but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?

EDIT

Note that I can cut data, which can have contribution in optimization

for example if I want to cut data by 2 it could return two persons

Johan Smith |
Jane Smith  | have same lastname and first name inita 
Jack Smith  |

Mark Tween  | have same lastname and first name inita 
Mac Tween   |

If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN

SELECT  *
FROM    Person p1
        INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE   SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1) 
        AND p2.LastName = p1.LastName
        AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
        p1.LastName, p1.FirstName

Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc.

Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off.

As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.

Things I'd want to know:

How long does it already take to run?
What's your runtime window? (User & app tolerance for query time.)
Is this run once a day? Week? Month? Quarter?
Do you have the permission to create tables, change current tables, or alter indexes?
Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%?
How stable is the matching criteria requirement?

Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.

You can try something like this and see the difference on the execution plans, or benchmark the results on performance:

;WITH DupPersons AS
(
    SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant
    FROM Person
)

SELECT *
FROM DupPersons
WHERE Quant > 1

Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.

继续阅读：sql sql-server-2008

How to optimize group by in table with huge number of records

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？