Regex: Parse streetname/number
C#/.NET 2.0
I need to parse a string containing 开发者_Python百科the street name and the house no in two separate values.
in: "Streetname 1a" out: "streetname" "1a"
"Street name 1a" "street name" "1a"
"Street name 1 a" "street name" "1 a"
My first choice was to split the string where I found a " " char but that will not work for the second case.
result[0] = trimmedInput.Substring(0, splitPosition).Trim();
result[1] = trimmedInput.Substring(splitPosition + 1).Trim();
What is the best way to do this? Can I use regular expressions?
Thanks
^(.+)\s(\S+)$
should do the trick
EDIT: this will work is the house number can't have spaces in it. Otherwise this problem can't be solved programmatically since the program will never know the semantics of string tokens.
House addresses are messy and inconsistent. I worked with address data and honestly, if you don't have the data in normalized form, you're basically screwed.
^(.+)\s(\d+(\s*[^\d\s]+)*)$
will cover some more cases, but pattern like that is a can of worms if I ever saw one.
You have to more clearly define the pattern you're looking for, assuming there even is one. There needs to be some general observations you can make that will always hold:
- A street address consists of a name and a number.
- The name always appears first.
- The name consists of one or more words, separated by spaces.
- The number is a number followed by an optional letter.
From a comment, the last point isn't strictly true because the number & letter portion of the street number can be separated by whitespace.
If you can't guarantee the order of the street name & number, and also that the words in the street name do not contain numbers, then I'm not really sure that anything is going to help you.
The following regex should cover most cases:
Regex reggie = new Regex(@"^(?<name>\w[\s\w]+?)\s*(?<num>\d+\s*[a-z]?)$", RegexOptions.IgnoreCase)
As Dyppl stated, street addresses are messy. But, if your address data represents US addresses and you have the complete address (including city, state, and/or ZIP Code) you could use an address verification service to parse (and verify!) and standardize the components. I work for SmartyStreets, an address verification provider. Here's a quick C# example I wrote a while back that calls our LiveAddress API:
https://github.com/smartystreets/LiveAddressSamples/blob/master/c-sharp/street-address.cs
Here's the resulting output for that example (notice that the street name and primary number are parsed in the "components" section):
[
{
"input_index": 0,
"candidate_index": 0,
"delivery_line_1": "3214 N University Ave",
"last_line": "Provo UT 84604-4405",
"delivery_point_barcode": "846044405140",
"components": {
"primary_number": "3214",
"street_predirection": "N",
"street_name": "University",
"street_suffix": "Ave",
"city_name": "Provo",
"state_abbreviation": "UT",
"zipcode": "84604",
"plus4_code": "4405",
"delivery_point": "14",
"delivery_point_check_digit": "0"
},
"metadata": {
"record_type": "S",
"county_fips": "49049",
"county_name": "Utah",
"carrier_route": "C016",
"congressional_district": "03",
"latitude": 40.27586,
"longitude": -111.6576,
"precision": "Zip9"
},
"analysis": {
"dpv_match_code": "Y",
"dpv_footnotes": "AABBR1",
"dpv_cmra": "Y",
"dpv_vacant": "N",
"ews_match": false
}
}
]
We provide an absolutely free subscription for low-usage users. Here's a link that explains all the fields:
http://wiki.smartystreets.com/liveaddress_api_users_guide#json-responses
EDIT: included latitude/longitude fields (newly released).
At first you should try to find the number by using String.LastIndexOf()
to split at a possible position.
Afterwards you should check if any character within this last group contains any digits like splittedValue.Any(c => Char.IsDigit(c));
. So if you find any numbers within this last group you can be pretty sure, that you did the split correct, but maybe there are addresses out there that doesn't match this behaviour.
Update
If you really have such noisy data which must be normalized i think you can't do anything better then @Dyppl said and using some complicated regular expression which must evolve by samples you get that won't work.
This is assuming all you "addresses" will be formatted in at least one of the ways mentioned above.
string address = "Streetname 1a"
string street = Regex.Replace(address, "^[^0-9]+", "");
string number = address.Replace(street, "");
Then trim both values.
精彩评论