开发者

SQL Query Help - Duplicate Removal

wasn't sure whether to put this in开发者_如何学Go Software or here, so I figured I'd start here I know this will be a straightforward answer from you SQL geniuses...

I have a table, it contains contacts that I import on a daily basis. I will have an ASP.NET front end for user interaction. From this table, my intention is to send them all mailers - but only one to each address. So my end result is a user enters a date (which corresponds to teh date imported) and they are given a resultant grid that has all the unique addresses associated to that date. I only want to send a mailer to that address once - many times my original imported list will contain multiple businesses at the same address.

Table: ContactTable Fielsd: ID, CompanyName, Address, City, State, Zip, Phone

I can use the SELECT DISTINCT clause, but I need all the data associated to it (company name, etc.)

I have over 262000 Records in this table.

If I select a sample date of 1/10/2011, I get 2401 records. SELECT DISTINCT Address from the same date gives me 2092 records. This is workable, I would send those 2092 people a mailer.

Secondly, I'd have to be able to historically check if a mailer was already sent to that address as well. I would not want to send another mailer to the same business tomorrow either.

What's my best way?


I would start with creating a table to lookup sent mailers.

ID | DateSent
-------------

Every time you send a mailer you are going to want to insert the ID, and the DateTime into it, this way when you go to pull the mailers you can look against this table to see if the mailer has been sent within whatever your specified time frame of mailing is. You can extend this if you have multiple types of mailers to include the mailer type.

Plain Old SQL

SELECT a.ID, a.CompanyName, b.Address, b.City, b.State, b.Zip, a.Phone
FROM a.ContactTable
RIGHT JOIN (SELECT DISTINCT Address, City, State, Zip
            FROM ContactTable) b
ON a.ID = b.ID

This sub-query is like creating a temp table SELECTing only the DISTINCT addresses, then joining it to the rest of the info.

To add the lookup against your new table add the following

SELECT a.ID, a.CompanyName, b.Address, b.City, b.State, b.Zip, a.Phone
FROM a.ContactTable
RIGHT JOIN (SELECT DISTINCT Address, City, State, Zip
            FROM ContactTable) b
ON a.ID = b.ID
RIGHT JOIN SentMailer c
ON a.ID = c.ID 

WHERE DATEDIFF(mm, c.DateSent, GETDATE()) > 12 --gives you everything that hasn't been sent a mailer within the last year

Edit

Without the data being standardized it's hard to get quality results. I've found in the past the more creative I have to get with my queries is a flag to bad table structure or data collection. I think you should still create a lookup table for ID/DateSent to manage the time frames for sending.

Edit

Yes, I'm basically looking for the unique address, city, state, zip. I would only require one instance for each address so we would be able to send a mailer to that address. At this point, Company name would not be required.

If this is the case you can simply do the following:

SELECT DISTINCT Address, City, State, Zip, Phone
FROM ContactTable

Keep in mind this won't scrub entries like Main Street vs Main St.


RogueSpear, I work in the address verification (and thus de-duplication) field for SmartyStreets, where we deal with this scenario a lot and tackle the challenge.

If you're getting daily lists from a company and have hundreds of thousands of records, then removing duplicate addresses using stored procedures or mere queries won't be enough to match the varying possibilities of each address. There are services which do this, and I'd point you to CASS-Certified vendors which provide that.

You can flag duplicates in a table using something like CASS-Certified Scrubbing, or you can prevent duplicates at point-of-entry with an API like LiveAddress. Anyway, I'd be happy to personally help you with any other address questions.


I would select, then remove, the duplicates like this:

SELECT  a.ID, a.PurgedID, a.CAMPAIGNTYPE, a.COMPANY, a.DBANAME, a.COADDRESS, a.COCITY, a.COSTATE, a.COZIP, a.FIRSTNAME1, a.DIALERPHONENUM, a.Purged FROM PurgeReportDetail a
WHERE EXISTS (
    SELECT * FROM PurgeReportDetail b WHERE
    b.COADDRESS = a.COADDRESS
    AND b.COCITY = a.COCITY
    AND b.COSTATE = a.COSTATE
    AND b.COZIP = a.COZIP
    AND b.id <> a.id
    ) -- This clause will only include rows with duplicate columns noted
AND a.ID IN (
    SELECT TOP 1 c.ID from PurgeReportDetail c 
    WHERE c.COADDRESS = a.COADDRESS
    AND c.COCITY = a.COCITY
    AND c.COSTATE = a.COSTATE
    AND c.COZIP = a.COZIP
    ORDER BY c.ID -- If you want the *newest* entry to be saved, add "DESC" here
    ) -- This clause gets the top 1 ID value for each matching set

or something like this. This will keep the first ID of the redundant address, just replace the SELECT with DELETE when ready.

EDIT: Of course this will only work on exact matches.

EDIT2: If you wanted to only check where you hadn't sent mailers, you should join both to a table of sent mailers from a specified date range

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜