开发者

regex substring C#

I need help to figure the regex expression

I have

string = "STATE changed from [Fixed] to [Closed], CLOSED DATE added [Fri Jan 14 09:32:19 
MST 2011], NOTES changed from [CLOSED[]<br />] to [TEST CLOSED <br />]"
开发者_如何转开发

I need to grab NOTES changed from [CLOSED[]<br />] to [TEST CLOSED <br />] and take values CLOSED[] and TEST CLOSED in two string variables.

So far I got to:

Regex NotesChanged = new Regex(@"NOTES changed from \[(\w*|\W*)\] to \[([\w-|\W-]*)\]");

which matches only if "NOTES changed from" started at the beginning and has no '[]' within '[ ]', but I have "[CLOSED[]]" and also no "

". Any ideas on what to change in regex.

Thanks, Sharma


If "<br />" is going to be there every time, you can use one of my favourite patterns (and it's worth memorizing). The pattern is:

delim[^delim]*delim

The pattern above will match a delimiter, followed by anything except the delimiter as many times as possible, then the delimiter again.

Here is the regular expression I would be tempted to use:

NOTES changed from \[([^<]*)[^\]]*\] to \[([^<]*)[^\]]*\]

In English:

  • Grabs the opening [
  • Capture #1 all characters until the < (assuming the br tag is always there)
  • Reads until the closing ]
  • Repeat for second capture zone


This is kind of wierd...

(\w*|\W*)

That a capturing group of all word characters zero or many times or all non word characters zero or many times

What you wanna do if you have matching braces is to create a pattern which doesn't consume the delimiter.

\[([^\]]+)\]

That will match any occurrence of [with some text in it] where the matched text is the first group in the match.

Since you have the same type of delimiters nested with in the string itself it gets a bit more tricker and you need to use "look-a-head" or some sort of alteration.

((?:[^\[\]]|\[\])*)

This can be future improved, but there's a problem here that can not be solved if you have [[[]]]. You cannot create a recursive regular expression. It is not that flexible. So you need to either hard code a max depth or apply the regular expression several times.

A fairly exhaustive way of doing this would be

\[((?:[^\[\]]*)(?:(?=\[)(?:[^\]]*)\])?([^\]]))\]


Try adding "\[|\]" to your capture sequence in the bracket group.

Regex NotesChanged = new Regex(@"NOTES changed from \[(\w*|\W*|\[|\])\] to \[([\w-|\W-|\[|\]]*)\]");


I believe you can use balancing group definitions to match the nested brackets. I believe these are .NET specific, at least in that particular implementation flavor. There's an example on that page, which I've adapted to your input here:

class Program {
    static void Main (string[] args) {
        var input = "STATE changed from [Fixed] to [Closed], CLOSED DATE added [Fri Jan 14 09:32:19 MST 2011], NOTES changed from [CLOSED[]] to [TEST CLOSED ]";
        var regex = new Regex(@"NOTES changed from (((?'open'\[)[^\[\]]*)+((?'close-open'\])[^\[\]]*)+)*");

        foreach (var match in regex.Matches(input)) {
            Console.WriteLine(match);
        }
    }
}

This prints NOTES changed from [CLOSED[]] to [TEST CLOSED ] for me. Note that in my adaption I left off the bit of the expression that causes it to fail to match if the square brackets are not properly balanced, in order to reduce my example to the barest minimum that would satisfy your request... the expression is already pretty unpleasantly complex.

EDIT: Just saw your question got edited a bit while I was posting. The parts of the regex I've supplied here that match "anything but [ and ]" should be able to be replaced with capture groups for the substrings you need to extract.


If you have the luxury of fixing the regex with specific keywords or phrases, the following would work:

NOTES changed from (?:(?:\[)?([A-Z]+\[\]))<br />\] to \[([A-Z]+\s+[A-Z]+)

The above would match the string NOTES changed from [CLOSED[]<br />] to [TEST CLOSED and put CLOSED[] and TEST CLOSED into 2 separate groups.

Update

In fact you can make this even shorter (and a bit more non-specific) by using the . specifier:

NOTES changed from (?:(?:\[)?([A-Z]+\[\])).+\[([A-Z]+\s+[A-Z]+)

This means it will match like the above, only instead of being specific about matching the <br /> tags etc in between it will match regardless of what is in between.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜