开发者

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"

Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.

This is the C# code:

s = "5135554444";
string str = Regex.Replace(s, @"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, @"^\(\) ", ""); //Get rid of empty () at the beginning

The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"

It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:

"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"

Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?

Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:

str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");

"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?

I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?

I would think that "54444" would turn into "5-4444" in this last example... then it does not...

How would one accomplish this?

(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting pro开发者_StackOverflow中文版blem, but to understand the Regex sintax)...

Thanks...


So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?

Use

^(\d{0,2}?)(\d{0,4})(\d{4})$

As a C# snippet, commented:

resultString = Regex.Replace(subjectString, 
  @"^             # anchor the search at the start of the string
    (\d{0,2}?)    # match as few digits as possible, maximum 2
    (\d{0,4})     # match up to four digits, as many as possible
    (\d{4})       # match exactly four digits
    $             # anchor the search at the end of the string", 
   "($1) $2-$3", RegexOptions.IgnorePatternWhitespace);

By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.

Without the ? in the first group, what would happen when trying to match 123456?

First, the \d{0,2} matches 12.

Then, the \d{0,4} matches 3456.

Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.

Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.


You have to tell it that it's OK if the first matching groups aren't there, but not the last one:

(\d{0,2}?)(\d{0,4}?)(\d{1,4})$

Matches your examples properly in my testing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜