How to match multiple sub-strings with a regular expression, even if they're optional?
Note: This is .NET regular expressions.
I have a bunch of text, from which I need to extract specific lines. The lines I care about have the following forms:
type Name(type arg1, type arg2, type arg3)
To match this, I came up with the following regular expression:
^(\w+)\s+(\w+)\s*\(\s*((\w+)\s+(\w+)(:?,\s+)?)*\s*\)$
This confusing mess produces a Match object that looks like this:
Group 0: type Name(type arg1, type arg2, type arg3)
Capture 0: type Name(type arg1, type arg2, type arg3)
Group 1: type
Capture 0: type
Group 2: Name
Capture 0: Name
Group 3: type arg3
Capture 0: type arg1,
Capture 1: type arg2,
Capture 2, type arg3
Group 4: type
Capture 0: type
Capture 1: type
Capture 2: type
Group 5: arg3
Capture 0: arg1
Capture 1: arg2
Capture 2: arg3
Group 6:
Capture 0: ,
Capture 1: ,
However, this is not the full input. Some of these lines might look like this:
type Name(type arg1, type[] arg2, type arg3)
Note the brackets b开发者_StackOverflow中文版efore arg2.
So, I modified my regular expression:
^(\w+)\s+(\w+)\s*\(\s*((\w+)\s*(\[\])?\s+(\w+)(:?,\s+)?)*\s*\)$
This produces a Match like this:
Group 0: type Name(type arg1, type arg2, type arg3)
Capture 0: type Name(type arg1, type arg2, type arg3)
Group 1: type
Capture 0: type
Group 2: Name
Capture 0: Name
Group 3: type arg3
Capture 0: type arg1,
Capture 1: type arg2,
Capture 2, type arg3
Group 4: type
Capture 0: type
Capture 1: type
Capture 2: type
Group 5: []
Capture0: []
Group 6: arg3
Capture 0: arg1
Capture 1: arg2
Capture 2: arg3
Group 7:
Capture 0: ,
Capture 1: ,
Group 5 does, in fact, contain the brackets. However, its only capture was #0, which is not the capture it was in (the second one).
Is there some way to correlate this capture to the appropriate group, or am I barking up the wrong tree?
An alternate way to implement this, I guess, would be to parse the arguments in the input separately. But, surely there's be a way to do it this way, isn't there?
EDIT:
To clarify, I'm not building a language parser. I'm converting old textual api documentation for a scripting language which looks like this:--- foo object ---
void bar(int baz)
* This does something.
* Remember blah blah blah.
int getFrob()
* Gets the frob
Into a new format that I can export to HTML, etc.
Edit mkII: For others benefit, here's the new revised code:
m = Regex.Match(line, @"^(\w+)\s+(\w+)\s*\((.*?)\)$");
if (m.Success) {
if (curMember != null) {
curType.Add(curMember);
}
curMember = new XElement("method");
curMember.Add(new XAttribute("type", m.Groups[1].Value));
curMember.Add(new XAttribute("name", m.Groups[2].Value));
if (m.Groups[3].Success) {
XElement args = new XElement("arguments");
MatchCollection matches = Regex.Matches(m.Groups[3].Value, @"(\w+)(\[\])?\s+(\w+)");
foreach (Match m2 in matches) {
XElement arg = new XElement("arg");
arg.Add(new XAttribute("type", m2.Groups[1].Value));
if (m2.Groups[2].Success) {
arg.Add(new XAttribute("array", "array"));
}
arg.Value = m2.Groups[3].Value;
args.Add(arg);
}
curMember.Add(args);
}
}
First, it matches the type Name(*)
part, and when it gets that, it matches type Name
repeatedly on the parameter part.
How I do this is to make it a two phase parser.
First, I make sure I know what I have. With that phase, I don't care about the matching groups.
The second phase actually tries to make sense of it all. From the first phase, it could e.g. be easy to get everything within the parenthesis, but parsing the arguments is hard. So, from the result within the parenthesis, you e.g. split that on the ,
and then parse the arguments one by one.
If that's too hard, because e.g. multi dimensional arrays are allowed ([,]
), you create a regular expression that eats the first argument from the part from within the parameter. You then know how long that argument is, remove that part from the arguments and have three left, etc.
Match the entire line and produce the part within the parenthesis:
"type Name(type arg1, type[] arg2, type arg3)" => "type arg1, type[] arg2, type arg3"
Parse the arguments:
a. Eat the first argument of the list of arguments:
"type arg1, type[] arg2, type arg3" => "type", "arg1"
b. Remove the length of the parsed argument from the list of arguments:
"type arg1, type[] arg2, type arg3" => ", type[] arg2, type arg3" ", type[] arg2, type arg3".TrimStart(new char[]{ ',', ' ' }) => "type[] arg2, type arg3"
c. If the string is not empty: lather, rinse, repeat.
精彩评论