开发者

Finding similar names in multiple tables

I have multiple开发者_如何学Python tables with different customer names. I am trying to find out how many times the same name is in a table. The challenge here is that someone could have entered the name as "John Smith" or "Smith, John".

There are 40,000 rows in each table and over 40 different tables. I am trying to query somehow without knowing the names but still return like names.

Basically I need to group like names without using a statement like:

WHERE cust_name LIKE '%john%'

How can you query multiple table columns using the contents of other table columns when the data within may not be in the same format? How would you best 'clean' the data to remove commas, spaces, etc?


WEll you have fuzzy logic avaliable in SSIS. I've used fuzzy grouing successfully to find duplicates -although you will want to match on more than name as many many people share the same names. I've done the match using name, address, phone and email. Fuzzy grouping allows you to use multiple fields for matching.


This really isn't a database problem. The real problem is coming up with an algorithm that will take a name and convert it into a standard format. This is hard to do and really depends on what your source data looks like. I would look through your source data and try to come up with some patterns to look for then use regular string manipulation to change them all into the same format.


Name matching can be very tricky business. Not only do you need to worry about "John Smith" vs. "Smith, John", but you usually need to worry about Katherine vs. Catherine vs. Kate vs. Kathy vs. Cathy. I'm sure that there are third-party data mining solutions for something like this, although none that I can recommend.

If you know that your names are only in the form "FirstName LastName" and "LastName, FirstName" then you could try something like this:

SELECT
    CASE
        WHEN name LIKE '%,%'
            THEN SUBSTRING(name, CHARINDEX(',', name) + 2, LEN(name)) + ' ' +
                 SUBSTRING(name, 1, CHARINDEX(',', name) - 1)
        ELSE name
    END AS name

The string functions may be dependent on your specific RDBMS. Also, this is pretty brittle. It relies on the exact format with a space after the column, etc. You'll need to tweak it if you want better than that.

I would also suggest that you add a view over the forty tables as a UNION ALL so that you can work with them all at once. Maybe hard-code something into the view so that you know which table each row came from.

Finally, you could look into using soundex, but implementing that might be difficult if you don't have experience with it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜