开发者

How to parse Postal Addresses from html (high tolerance - low strictness)

I am looking for ideas on how t开发者_开发百科o extract postal addresses from various web sources. I'm using HtmlAgilityPack to convert the html to a XDocument (Csharp 4.0)

Not looking to break down the address to components, rather just getting the address as a whole. I'm willing to accept a fairly high inaccuracy level.

The addresses will be potentially from au, uk, ca and usa sites.

This answer provides a good regex solution


It looks like using the regex solution (provided above) will get you a fair amount of the addresses. You mentioned that you are willing to accept a fairly high inaccuracy level, but you don't necessarily have to. Depending on how clean you can get the data, you can then do some address list cleanup, or "scrubbing" as it is sometimes called. That is when you take a malformed address (depending on how badly it was scraped from the html) and run it through a standardization engine and then through a verification engine. Many times, this will take an undeliverable address and return a fully qualified and deliverable address. I'm speaking of USPS (USA) addresses because that is what I have experience with but I'm sure there are some other countries that have similar services. These scrubbing services can be either real-time or batch, depending on your needs. Most of them are relatively quick as well. Hope this helps.

I work for an address verification company called smartystreets.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜