MySQL street address fuzzy search
Does anyone know of a good script (or good strategy) for doing a fuzzy street address search on a MySQL database? The key issues are:
- capitalization (easy -- just use LCASE)
- punctuation (could use REPLACE; not sure it there are more efficient op开发者_如何学Gotions)
- abbreviations (this is the tough one -- street = st, etc.)
I want to be able to match: 123 Main st, unit B = 123 Main Street Unit b
While it's not perfect and can be quite slow, you'll want to use a regular expression via REGEXP().
Here's a first-pass regular expression to match most cases (as well as your example):
(?isx) # search across multiple lines and ignore case
( # full match
( # st number - what about number words like one or two?
\d+
)
\s+ # whitespace
( # street name (one or more words)
[a-z]+
(?:
\s+
[a-z]+
)*
)
\s+ # whitespace
( # street type
al(?:y\.?|ley) # aly, aly. or alley
|
ave(?:\.|nue)? # ave, ave., or avenue
|
b(?lvd\.?|oulevard) # blvd, blvd. or boulevard
|
c(?:t\.?|ourt) # ct, ct. or court
|
cir(?:\c\.?|cle)? # cir, circ, circ. or circle
|
cres(?:\.|cent)? # cres, cres. or crescent
|
dr(?:\.|ive)? # dr, dr. or drive
|
exp(?:y\.?|ressway) # expy, expy. or expressway
|
f(?:wy\.?|reeway) # fwy, fwy. or freeway
|
g(?:rdns\.?|ardens) # grdns, grdns. or gardens
|
h(?:wy\.?|ighway) # hwy, hwy. or highway
|
l(?n\.?|ane) # ln, ln. or land
|
m(?:nr\.?|anor) # mnr, mnr. or manor
|
m(?:trwy\.?|otorway) # mtrwy, wtrwy. or motorway
|
pl(?:\.|ace)? # pl, pl. or place
|
r(?:d\.?|oad) # rd, rd. or road
|
st(?:\.|reet)? # st, st. or street
|
t(?:pk\.?|urnpike) # tpk, tpk. or turnpike
|
ter(?:\r?\.?|race) # ter, ter., terr, terr. or terrace
|
tr(?:l.\?|ail) # trl, trl. or trail
|
pike|park|walk|loop|bay|close|gate|highlands
|
row|way|oval|dell|rise|vale|byway|lawn
)
\,? # optional comma
\s+ # whitespace
( # optional number, unit, apt or floor
(
\# # number
|
unit # unit
|
num(?:\.|ber) # num, num. or number
|
ap(?:t\.?|artment) # apt, apt. or apartment
|
fl(?:\.|oor)? # fl, fl. or floor
)
\s+
\d+
)?
)
Which will return:
$1 - full match
$2 - house number
$3 - street name
$4 - street type
$5 - unit or apt number (optional)
To use this in mysql, you'll need to strip out all the comments (from '#' to eol), remove the first line (switch options), and collapse everything to a single line without any whitespace.
Doing a good job of finding matches is hopeless unless you standardize first. Once you pass your addresses through standardization software, searches are trivial because exact matches are sufficient.
you can use address data from Experian QAS. its on a per click license. The fuzz search works nice on the data and addressees are more accurate.
What you're describing is address standardization as Mike pointed out. I work in this field for SmartyStreets (where addresses is what we do).
The USPS has certain guidelines called Publication 28 which describes the expected format of addresses. If you intend to use the addresses at all you'll want to standardize according to that format and, in most cases, perform an exact string comparison of the standardized strings.
The LiveAddress API does such a thing, but if you already have a list of addresses to standardize, you could look into a list processing service.
精彩评论