开发者

MySQL duplicates -- how to specify when two records actually AREN'T duplicates?

I have an interesting problem, and my logic isn't up to the task.

We have a table with that sometimes develops duplicate records (for process reasons, and this is unavoidable). Take the following example:

id  FirstName  LastName  PhoneNumber   email
--  ---------  --------  ------------  --------------
 1  John       Doe       123-555-1234  jdoe@gmail.com
 2  Jane       Smith     123-555-1111  jsmith@foo.com
 3  John       Doe       123-555-4321  jdoe@yahoo.com
 4  Bob        Jones     123-555-5555  bob@bar.com
 5  John       Doe       123-555-0000  jdoe@hotmail.com
 6  Mike       Roberts   123-555-9999  roberts@baz.com
 7  John       Doe       123-555-1717  wally@domain.com

We find the duplicates this way:

SELECT c1.* 
FROM `clients` c1
INNER JOIN (
    SELECT `FirstName`, `LastName`, COUNT(*)
    FROM `clients`
    GROUP BY `FirstName`, `LastName`
    HAVING COUNT(*) > 1
) AS c2
ON c1.`FirstName` = c2.`FirstName`
AND c1.`LastName` = c2.`LastName`

This generates the following list of duplicates:

id  FirstName  LastName  PhoneNumber   email
--  ---------  --------  ------------  --------------
 1  John       Doe       123-555-1234  jdoe@gmail.com
 3  John       Doe       123-555-4321  jdoe@yahoo.com
 5  John       Doe       123-555-0000  jdoe@hotmail.com
 7  John       Doe       123-555-1717  wally@domain.com

As you can see, based on FirstName and LastName, all of the records are duplicates.

At this point, we actually make a phone call to the client to clear up potential duplicates.

After doing so, we learn (for example) that records 1 and 3 are real duplicates, but records 5 and 7 are actually two different people altogether.

So we merge any extraneously linked data from records 1 and 3 into record 1, remove record 3, and leave records 5 and 7 alone.

Now here's were the problem comes in:

The next time we re-run the "duplicates" query, it will contain the following rows:

id  FirstName  LastName  PhoneNumber   email
--  ---------  --------  ------------  --------------
 1  John       Doe       123-555-4321  jdoe@gmail.com
 5  John       Doe       123-555-0开发者_运维百科000  jdoe@hotmail.com
 7  John       Doe       123-555-1717  wally@domain.com

They all appear to be duplicates, even though we've previously recognized that they aren't.

How would you go about identifying that these records aren't duplicates?

My first though it to build a lookup table identifying which records aren't duplicates of each other (for example, {1,5},{1,7},{5,7}), but I have no idea how to build a query that would be able to use this data.

Further, if another duplicate record shows up, it may be a duplicate of 1, 5, or 7, so we would need them all to show back up in the duplicates list so the customer service person can call the person in the new record to find out which record he may be a duplicate of.

I'm stretched to the limit trying to understand this. Any brilliant geniuses out there that would care to take a crack at this?


Interesting problem. Here's my crack at it.

How about if we approach the problem from a slightly different perspective.

Consider that the system is clean for a start i.e all records currently in the system are either with Unique First + Last name combinations OR the same first + last name ones have already been manually confirmed to be different people.

At the point of entering a NEW user in the system, we have an additional check. Can be implemented as an INSERT Trigger or just another procedure called after the insert is successfully done.

  1. This Trigger / Procedure matches the FIRST + LAST name combination of "Inserted"record with all existing records in the table.
  2. For all the matching First + Last names, it will create an entry in a matching table (new table) with NewUserID, ExistingMatchingRecordsUserID

From an SQL perspective,

TABLE MatchingTable
COLUMNS 1. NewUserID 2. ExistingUserID
Constraint : Logical PK = NewUserID + ExistingMatchingRecordsUserID

INSERT INTO MATCHINGTABLE VALUES ('NewUserId', userId)
SELECT userId FROM User  u where u.firstName = 'John' and u.LastName = 'Doe'

All entries in MatchingTable need resolution.

When say an Admin logs into the system, the admin sees the list of all entries in MatchingTable

eg: New User John Doe - (ID 345) - 3 Potential matches John Doe - ID 123 ID 231 / ID 256

The admin will check up data for 345 against data in 123 / 231 and 256 and manually confirm if duplicate of ANY / None If Duplicate, 345 is deleted from User Table (soft / hard delete - whatever suits you) If NOT, the entries for ID 354 are just removed from MatchingTable (i would go with hard deletes here as this is like a transactional temp table but again anything is fine).

Additionally, when entries for ID 354 are removed from MatchingTable, all other entries in MatchingTable where ExistingMatchingRecordsUserID = 354 are automatically removed to ensure that unnecessary manual verification for already verified data is not needed.

Again, this could be a potential DELETE trigger / Just logic executed additionally on DELETE of MatchingTable. The implementation is subject to preference.


At the expense of adding a single byte per row to your table, you could add a manually_verified BOOL column, with a default of FALSE. Set it to TRUE if you have manually verified the data. Then you can simply query where manually_verified = FALSE.

It's simple, effective, and matches what is actually happening in the business processes: you manually verify the data.

If you want to go a step further, you might want to store when the row was verified and who verified it. Since this might be annoying to store in the main table, you could certainly store it in a separate table, and LEFT JOIN in the verification data. You could even create a view to recreate the appearance of a single master table.

To solve the problem of a new duplicate being added: you would check non-verified data against the entire data set. So that means your main table, c1, would have the condition manually_verified = FALSE, but your INNER JOINed table, c2, does not. This way, the unverified data will still find all potential duplicate matches:

SELECT * FROM table t1
INNER JOIN table t2 ON t1.name = t2.name AND t1.id <> t2.id
WHERE t1.manually_verified = FALSE

The possible matches for the duplicates will be in the joined table.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜