开发者

Database with inconsistencies

(I'm using SQLite3)

My table has fields like

Name
Zipcode
State
City
State
StreetAddress

The issue is that in the name field, people entered data that is inconsistently formatted

For example,

Name (single column)
====================
John Smith MD
Jason Young M.D
Charlie Horse MD.
Kim Powell PH.d
Sandy Fox Phd M.D.

I want to clean up the MD to be saved as M.D. and Phd/PH.d to be PhD (pl开发者_JS百科ease remember that the name is 1 field).

The street address also has issues like multiple spaces between words instead of 1.

Can someone please explain what I can do to format the data? I could do it manually but it would take too long!


You shouldn't simply accept data from a user interface and immediately persist it in a database without validating and scrubbing. You're safer from SQL injection that way, too.

I don't know what object(s) you have between your view and persistence tiers, but I'd suggesting adding a validation and binding layer in your controller or service layer, preferably the controller.


You can try using Google Refine (a free downloadable program) http://code.google.com/p/google-refine/ Great for all sorts of data cleaning...


For the address, you may want to validate it against your region/country specific address database.

  • Open Source Address Scrubber?
  • -

I want to clean up the MD to be saved as M.D. and Phd/PH.d to be PHd (please remember that the name is 1 field).

Find all the variations of MD and PHD and run a series of updates (replace function):

update tbl set name = replace(name, 'M.D.', 'MD')
update tbl set name = replace(name, 'MD.', 'MD')
update tbl set name = replace(name, 'M.D', 'MD')
update tbl set name = replace(name, 'M.D', 'MD')
update tbl set name = replace(name, ' MD', ' MD')  -- fix case, hope no name starts with "MD"
update tbl set name = replace(name, 'PHD', 'PhD')  -- fix case
update tbl set name = replace(name, 'PH.D', 'PhD')  -- fix case
update tbl set name = replace(name, 'PH-D', 'PhD')  -- fix case
etc for any other variants you can think of or encounter

The street address also has issues like multiple spaces between words instead of 1

Replace two spaces with one, multiple times.. after removing linebreaks. To more quickly collapse series of spaces, we use 8->1 twice, then 4->1, then 2->1

update tbl set address = replace(address, '\r', ' ')
update tbl set address = replace(address, '\n', ' ')
update tbl set address = replace(address, '        ', ' ')
update tbl set address = replace(address, '        ', ' ')
update tbl set address = replace(address, '    ', ' ')
update tbl set address = replace(address, '    ', ' ')
update tbl set address = replace(address, '  ', ' ')
update tbl set address = replace(address, '  ', ' ')


This is a series of string translations. Most of them are easy, and regex's may not be required for many of them.

Simple algorithm:

1) Split the name on spaces

2) discard empty values, these are the results of two or more spaces

3) Take each value, convert to upper case, remove periods, dashes or anything else and look for it in a table of known suffixes, like "MD", "PHD", and replace it with the desired value.

4) From step 3, anything that does not match, leave as-is

Obviously this requires you to build a list of known suffixes like MD, etc. This is not that bad. Large professional packages that do this in Direct Marketing handle it that way.

5) Recombine the resulting values

There is no definition of correct for formatting names, it is a question of how far you want to go for how much uniformity you require.

Adding to that, Duffy's advice is sound.


All of the above are sound and probably necessary. I'd like to add that it is also preferable to limit the input on fields as much as possible. For example the suffixes field should be limited to pre-validated values by using a drop-down list. This simplifies the data entry and data verification.


Regular expressions and string operations will may work for finite subsets of address data but there are lots of exception cases in the address industry. The USPS is the authority here (at least for US addresses), so why not utilize a service that is certified with the USPS?

For the addresses, try out this batch processing service:

http://www.smartystreets.com/products/cass-certified-scrubbing

SmartyStreets also offers an API with a free subscription for low-usage users. I'm a software developer at SmartyStreets and have helped with both of these services.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜