I need help splitting addresses (number, addition, etc)
Apologies for the fuzzy title...
My problem is this; I have a SQL Server table persons with about 100.000 records. Every person has an address, something like "Nieuwe Prinsen开发者_StackOverflow中文版gracht 12 - III". The customer now wants to separate the street from the number and addition (so each address becomes two or three fields). The problem is that we can not be sure of the format the current address is in, it could also simply be something like "Velperweg 30".
The only thing we do know about it is that it's a piece of text, followed by a number, possibly followed by some more text (which can contain a number).
A possible solution would be to do this with regexes, but I would much (much, much) rather do this using a query. Is there any way to use regexes in a query? Or do you have any other suggestions how to solve such a problem?
Something like this maybe?
SELECT
substring([address_field], 1, patindex('%[1-9]%', [address_field])-1) as [STREET],
substring([address_field], patindex('%[1-9]%', [address_field]), len([address_field])) as [NUMBER_ADDITON]
FROM
[table]
It relies on the assumption that the [street] field will not contain any numbers, and the [number_addition] field will begin with a number.
SQL Server and T-SQL are rather limited in their processing prowess - if you're really serious about heavy-lifting and regexes etc., you're best bet is probably creating an assembly in C# or VB.NET that does all that tricky Regex business, and then deploying that into SQL-CLR and use the functions in T-SQL.
"Pure" T-SQL cannot really handle much string manipulation beyond SUBSTRING and CHARINDEX - but that's about it.
In answer to your "Is there any way to use regexes in a query?", then yes there is, but it needs a little .NET knowledge. Create a CLR assembly with a user-defined function that does your regex work. Visual Studio 2008 has a template project for this. Deploy it to your SQL server and call it from your query.
Name and Address parsing and standardization is probably one of the most difficult problems we can encounter as programmers for precisely the reasons you've mentioned.
I assume that whoever you work for their main business is not address parsing. My advice is to buy a solution rather than build one of your own.
I am familiar with this company. Your address examples appear to be non US or Canadian so I don't know if their products would be useful, but they may be able to point you to another vendor.
Other than a user of their products I am not affiliated with them in any way.
This sounds like the common "take a piece of complex text that could look like anything and make it look like what we now want it to look like" problem. These tend to be very hard to do using only T-SQL (which does not have native regex functionality). You will probably have to work with complex code outside of the database to solve this problem.
TGnat is correct. Address standardization is complicated.
I've encountered this problem before.
If your customer doesn't want to spring for the custom software, develop a simple GUI that allows a person to take an address and split it manually. You'd delete the address row with the old format and insert the row with the new address format.
It wouldn't take long for typists familiar with your addresses to manually make 100,000 changes. Of course, it's up to the customer if he wants to spend the money on custom software or typists.
But you shouldn't be stuck with the data cleaning bill, either.
I realize that this is an old question, but for future reference, I still decided to add an answer using regex (also so I don't forget it myself). Today, I ran into a similar problem in Excel, in which I had to split the address in street and house number too. In the end, I've copied the column to SublimeText (a shareware text editor), and used a regex to do the job (CTRL-H, enable regex):
FIND: ^('?\d?\d?\d?['-\.a-zA-Z ]*)(\d*).*$
REPLACE FOR THE HOUSE NUMBER: $2
REPLACE FOR THE STREET NAME: $1
Some notes:
- Some addresses started with a quote, e.g. 't Hofje, so I needed to add '?
- Some addresses contained digits at the start, e.g. 17 Septemberplein or 2e Molendwarsstraat, so I added \d?\d?\d?
- Some addresses contained a -, e.g. Willem-Alexanderlaan or a '
精彩评论