开发者

How to optimize group by in table with huge number of records

I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same pe开发者_StackOverflow社区rson or not.

Here is the query I write

SELECT * 
FROM Person INNER JOIN
(
    SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
    FROM Person
    GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
    HAVING count(*) > 1
) as dupPersons 
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName

but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?

EDIT

Note that I can cut data, which can have contribution in optimization

for example if I want to cut data by 2 it could return two persons

Johan Smith |
Jane Smith  | have same lastname and first name inita 
Jack Smith  |

Mark Tween  | have same lastname and first name inita 
Mac Tween   |


If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN

SELECT  *
FROM    Person p1
        INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE   SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1) 
        AND p2.LastName = p1.LastName
        AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
        p1.LastName, p1.FirstName        


Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc.

Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off.

As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.

Things I'd want to know:

  • How long does it already take to run?
  • What's your runtime window? (User & app tolerance for query time.)
  • Is this run once a day? Week? Month? Quarter?
  • Do you have the permission to create tables, change current tables, or alter indexes?
  • Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%?
  • How stable is the matching criteria requirement?

Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.


You can try something like this and see the difference on the execution plans, or benchmark the results on performance:

;WITH DupPersons AS
(
    SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant
    FROM Person
)

SELECT *
FROM DupPersons
WHERE Quant > 1

Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜