I need an address matching algorithm
I have looked around online for this but haven't found much really. Basically I need to compare a bunch of addresses to see if they match. The addresses could be written in all different ways. For Example : 1345 135th st NE, 1345 NE 135TH ST, etc. Plus they could be in different languages as well. Before I attempt to write some parsing matching algorithm on my own does anyone know any libraries or ways I could easily do this? My friend though of using google or bing maps web service and passing them the address and getting back the geo-coordinates and comparing using the 开发者_Go百科coordinates instead of string matching. But then I have to call a web service thousands of times for all these addresses I have, not very elegant ;) Any help would be nice :)
I don't think that this is a REGEX type of problem. You are looking at converting to a comparable format first.
There are several web services / products available that will standardize an address for you. Bing for "USPS Address Standardization API" and you will find a ton of information. Once the address is standardized, the comparison should be straightforward.
http://www.bing.com/search?q=usps+address+standardization+api&go=&form=QBRE&qs=n&sk=&sc=1-32
Alternatively you can GeoCode the address to get a set of coordinates and then compare those.
http://code.google.com/apis/maps/documentation/geocoding/
US addresses can (usually) be uniquely represented by a 12-digit number called the delivery point (DPBC). This number consists of the full 9-digit ZIP Code and a 3 digit delivery point number. This is what is used to form barcodes on mail pieces to speed up delivery. Using a service that is CASS-Certified can provide the 12-digit delivery point and even flag duplicates for you.
In the interest of full disclosure I work for SmartyStreets, which was formerly Qualified Address, which was mentioned in the other answer by Mowgli.
We provide an API that can be queried as well as a batch processing service (which will flag duplicates as explained above).
Keep in mind that even the 12-digit DPBC doesn't always uniquely identify a particular address. This happens frequently when a particular street block, or 9-digit ZIP code, has a long stretch of homes that have similar primary numbers. In these cases, it's best to use a CASS service to standardize and validate the addresses, then hash them for convenient comparisons. (But as said, duplicates will already be flagged by some CASS services.)
Update: SmartyStreets now provides international address verification.
I wouldn't consider this a regex problem.
One free tool that could be helpful is usaddress, a python library for parsing addresses. It performs pretty well on all sorts of address formats, b/c it uses a probabilistic approach rather than a regex approach (although it is made for US addresses, & may not work well on addresses in other languages) http://usaddress.readthedocs.org/en/latest/
Parsing addresses won't solve your problem 100%, but comparing two addresses, especially addresses w/ varying formats, will be much easier if the addresses are split into their respective components (so that you can compare street # against street #, city against city, etc)
Then, to compare records, you can use dedupe - another free python library. http://dedupe.readthedocs.org/en/latest/
I found 2 options.
Firstly, maybe, instead of taking any input, you let the users choose from a limited number of options, similar to how facebook deals with addresses. If you use an autocomplete api, as they type, the possible addresses will be narrowed down by the api. Here is one from google:
http://code.google.com/p/geo-autocomplete/
Secondly, address finding & qualifying (but they arn't free):
https://www.craftyclicks.co.uk/
https://smartystreets.com/ (Previously Qualified Address)
https://www.alliescomputing.com/ (Previously offered World Addresses)
There is an open source python library for record deduplication / entity resolution that can be applied to address matching: Dedupe.
It's free and can be run on a laptop, as opposed to a huge server.
This requires intelligence to do correctly; computers aren't intelligent.
A simple algorithm could tell you which addresses have something in common, for example, "1345 135th st NE" and "1345 NE 135TH ST" have the number "1345" in common.
You would then have fewer to compare yourself. It would also reduce the number you geolocate.
This is definitely not a REGEX problem. This is 2018 and we have hands on more advanced methods yet. Both R and python offer solutions for that type of problem
In R: https://cran.r-project.org/web/packages/RecordLinkage/index.html
In python: https://recordlinkage.readthedocs.io/en/latest/about.html
1. Using address string similarity
Bacause of addresses could be written in many different ways it's usful to apply fuzzy logic and calculate similarity of address strings. I used to solve this task a fuzzywuzzy
Python library. It has a functions that calculate Levenshtein Distance as a differences between strings.
from fuzzywuzzy import fuzz
addr1 = "USA AZ 850020 Phoenix Green Garden street, 283"
addr2 = "850020, USA AZ Phoenix Green Garden, 283, 3a"
addr3 = "Canada VC 9830 Vancouver Dark Ocean street, 283"
addr_similarity12 = fuzz.token_set_ratio(addr1, addr2)
addr_similarity13 = fuzz.token_set_ratio(addr1, addr3)
print(f"Address similarity 1 <-> 2: {addr_similarity12}")
print(f"Address similarity 1 <-> 3: {addr_similarity13}")
Output will be:
Address similarity 1 <-> 2: 96
Address similarity 1 <-> 3: 55
Really, first two addresses is almost the same and last two ones are different. Important task is a choosing appropriate threshold that will indicate address equality.
2. Using Google Map Geocoding API
Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739). And then it's possible to calculate numerical "distance" between two addresses.
Well one way to solve this problem is to convert both the addresses in same format. One easy way to do this but using Google Map Geocoding API is to simply pass both addresses to the API and get the output. The output for Geocoding API looks something like:
FORMAT OF GOOGLE'S GEODIRECTORY API (for reference):
{'results': [{'address_components': [{'long_name': '22',
'short_name': '22',
'types': ['street_number']},
{'long_name': 'Rue de Berri',
'short_name': 'Rue de Berri',
'types': ['route']},
{'long_name': 'Paris',
'short_name': 'Paris',
'types': ['locality', 'political']},
{'long_name': 'Département de Paris',
'short_name': 'Département de Paris',
'types': ['administrative_area_level_2', 'political']},
{'long_name': 'Île-de-France',
'short_name': 'IDF',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'France',
'short_name': 'FR',
'types': ['country', 'political']},
{'long_name': '75008', 'short_name': '75008', 'types': ['postal_code']}],
'formatted_address': '22 Rue de Berri, 75008 Paris, France',
'geometry': {'location': {'lat': 48.8728822, 'lng': 2.3054154},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 48.8743208802915,
'lng': 2.306719730291501},
'southwest': {'lat': 48.8716229197085, 'lng': 2.304021769708497}}},
'place_id': 'ChIJWxDbRsFv5kcRRcfu62JSRog',
'plus_code': {'compound_code': 'V8F4+55 Paris, France',
'global_code': '8FW4V8F4+55'},
'types': ['establishment', 'lodging', 'point_of_interest']}],
'status': 'OK'}
Here notice how google has provided you the different components of addresses like street number, locality etc. Now you can do a weighted/fuzzy matching between these components. Its upto you whether you want all to match or maybe some rules like street number or numbers shoulds always match, for other its okay if 4 out of 5 matches. Also you can consider distance between coordinate (Note : Use Haversine function and not just Euclidean Reference : https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b ). You can then have a weighted score which should be greater than threshold for them to be consider same place.
精彩评论