Addresses stored in SQL server have many small variations(errors)
I have a table in my database which stores packing slips and their information. I'm trying to query that table and get each unique address. I've come close, but I still have many near misses and I'm looking for a way to exclude these near duplicates from my select.
Sample Data
CompanyCode CompanyName Addr1 City State 开发者_开发百科Zip
10033 UNITED DIE CUTTING & FINISHIN 3610 HAMILTON AVE CLEVELAND Ohio 44114
10033 UNITED DIE CUTTING & FINISHING 3610 HAMILTON AVE CLEVELAND Ohio 44114
10033 UNITED DIE CUTTING & FINISHING 3610 HAMILTON AVE. CLEVELAND Ohio 44114
10033 UNITED DIE CUTTING & FINISHING 3610 HAMILTON AVENUE CLEVELAND Ohio 44114
10033 UNITED DIECUTTING & FINISHING 3610 HAMILTON AVE CLEVELAND Ohio 44144
10033 UNITED FINISHING 3610 HAMILTON AVE CLEVLAND Ohio 44114
10033 UNITED FINISHING & DIE CUTTING 3610 HAMILTON AVE CLEVELAND Ohio 44114
And all I want is 1 record. Is there some way I can get the "Average" record? Meaning, if most of the records say CLEVELAND instead of CLEVLAND, I want my 1 record to say CLEVELAND. Is there any way to par this data down to what I'm looking for?
Desired Output
CompanyCode CompanyName Addr1 City State Zip
10033 UNITED DIE CUTTING & FINISHING 3610 HAMILTON AVE CLEVELAND Ohio 44114
You are going to struggle. Personally I think I'd consider having a process that tries to update the data in the database and correct it.
You could change the system that inputs the data (or if that's not possible, have an external process that runs over the data once it's in the db) that does something like the following:
- Against known lists of things like towns/states/countries etc to catch typos.
- For known regular mistakes and abbreviations. E.g. "Avenue"/"Ave." or "Street"/"St." and normalises the values.
- Change the input system to do this kind of validation and/or provide the users with an address search/validation UI that allows then to search for an address given some known values (zip/postal code etc). You can buy data like this from various suppliers depending on where you are in the world.
If this all works (I doubt you'll get 100% unless you provide a mechanism whereby those things that can't be auto-corrected are flagged for human intervention), then your reporting is as simple as SELECT DISTINCT
...
Is it one-time job, I hope? It's impossible unless you can explain (in SQL terms) why first record is what you need. As an temporary solution I'd suggest following query
select C1.* from Company C1,
(select CompanyCode, min(CompanyName) as CompanyNameSelected
from Company
group by CompanyCode) C2
where
C1.CompanyCode = C2.CompanyCode and
C1.CompanyName = C2.CompanyNameSelected;
You could use any of aggregation functions instead of min
(returning CompanyName of course), or even write your own stored function, but the only thing is needed - you have to explain in query language why record #1 is better than #2.
You need to use subqueries in the select statement Try this query:
select CompanyCode,
(select top 1 CompanyName from Table1 where CompanyCode=X.CompanyCode
group by CompanyName order by count(*) desc) CompanyName,
(select top 1 Addr1 from Table1 where CompanyCode=X.CompanyCode
group by Addr1 order by count(*) desc) Addr1,
(select top 1 City from Table1 where CompanyCode=X.CompanyCode
group by City order by count(*) desc) City,
(select top 1 State from Table1 where CompanyCode=X.CompanyCode
group by State order by count(*) desc) State,
(select top 1 Zip from Table1 where CompanyCode=X.CompanyCode
group by Zip order by count(*) desc) Zip
from Table1 X
group by CompanyCode
精彩评论