开发者

In SQL Server what is most efficient way to compare records to other records for duplicates with in a given range of values?

We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.

Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.

So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.

For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.

(order_id, customer_id, product_id, quantity, sale_date, price)

We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.

  1. Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
  2. Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
  3. Records that have the same customer_id, product_id, but different quantities within 20 units, and sales dates within five days of each other should be considered suspect.

Is it possible to satisfy each one of these cr开发者_如何学运维iteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?


If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.

On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.


I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set

You could

-- Assuming some pkid (primary key) has been added 1.

select pkid,order_id, customer_id product_id, quantity, sale_date from orders o join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date and o.quantity=o2.quantity and o.customerid<>o2.customerid

then keep joining up more copies of orders, I suppose


You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.

With InputData As
    (
    Select order_id, product_id, sale_date, quantity, customer_id
        , Case 
            When O.sale_date = O2.sale_date Then 1 
            When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 
                And Abs( O.quantity - O2.quantity ) <= 20 Then 3
            When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5  Then 2
            Else 0
            End As MarkedForReview
    From Orders As O
        Left Join Orders As O2
            On O2.order_id <> O.order_id
                And O2.customer_id = O.customer_id
                And O2.product_id = O.product_id
    )
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0

Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜