Detect/Parse Mailing Addresses in Text
Are there any open source/commercial libraries out there that can detect mailing addresses in text, just like how Apple's Mail app underlines addresses on the Mac/iPhone开发者_Python百科.
I've been doing a little online research and the ideas seem to be either to use Google, Regex or a full on NLP package such as Stanford's NLP, which usually are pretty massive. I doubt iPhone has a 500MB NLP package in there, or connects to Google every time you read an email. Which makes me to believe there should be an easier way. Too bad UIDataDetectors is not open source.
I know this question has been asked before, but there were no conclusive answers, so here's my try.
As for Python you can try Pyap: https://pypi.python.org/pypi/pyap
It currently supports US and Canadian addresses
Parsing addresses isn't a science. At my office we have been dealing with address parsing for years and the problem is that there aren't any rules about what constitutes a valid address. We use the USPS address database for cleaning addresses which is actually pretty fast and way more accurate than we were ever able to get on our own. It gets us 98% accuracy where as before we got about 90% cleaned addresses.
The bigger problem with address parsing tends to be that people don't input the address the same way. The same address might be in all the following forms.
128 E Beaumont St
128 East Beaumont Street
128 E Bmt St
128 Beaumont Street
128 Highway 88
The third one looks totally wrong but people will type that sometimes. Sometimes a street is also a highway. There are a bunch of possibilities. Just try to catch 90% and you accept that is as good as it gets for address parsing.
Extractiv provides commercial NLP powered by Language Computer Corporation that can parse entities and relations in either uploaded documents or from web crawls. The former service utilizes a REST API. I dropped this URL in, and it extracts 4/5 of the addresses. Note, having them strung like that together makes them especially difficult.
Search for "address" in this JSON output: http://rest.extractiv.com/extractiv/?url=https://stackoverflow.com/questions/5099684/detect-parse-mailing-addresses-in-text&output_format=json
One of them:
{
"id": 11,
"len": 17,
"offset": 1557,
"text": "128 E Beaumont St",
"type": "ADDRESS"
},
(Note: if you use the HTML output, which is more for demos, it filters out non-sentence content, which is why I showed the JSON instead).
Disclaimer: I work at Extractiv.
Update: Extractiv is no more.
You can actually get extremely high accuracy as Drew mentioned by extracting the addresses and then comparing them against the USPS data. Getting a DVD from the USPS yearly will certainly work but doesn't factor in the addresses that change. For that, you would want a more up-to-date version. The USPS publishes it's updated address data (in proprietary format) monthly so that would be a good source of authoritative addresses.
On top of that, using an address validation service (after you extract the address data) will standardize the addresses for you and then check them for deliverability and/or vacancy status. As Drew mentioned, the same address can be written in many different ways that still work. However, the USPS will always use the standardized format.
In order to do what you are looking for programmatically, you'll definitely want an API, although list processing services are also available.
SmartyStreets has a free address validation API called LiveAddress that will standardize, verify, and then validate any US postal address. In the interest of full disclosure, I'm the founder of SmartyStreets.
精彩评论