开发者

Regex to find commas that aren't inside "( and )"

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:

,a,b,c,d,"("x","y",z)",e,f,g,

Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.

I tried a lot of combinations but regular expressions is still a little foggy for me.

I want it to use with the split method in Java. The example is short, but it can be开发者_StackOverflow much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.

So, want to do something like this:

String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g

Thanks!


You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:

String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";

String[] parts = text.split(";(?![^<>]*>)");

System.out.println(java.util.Arrays.toString(parts));
//  _  _  _  _  _______  _  _  _  _________  _  _  _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]

Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.


On the pattern

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.

The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.

The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.

Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.

This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.

You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).

References

  • regular-expressions.info/Character class, Repetition, Lookarounds, Possessive


Try this one:

(?![^(]*\)),

It worked for me in my testing, grabbed all commas not inside parenthesis.

Edit: Gopi pointed out the need to escape the slashes in Java:

(?![^(]*\\)),

Edit: Alan Moore pointed out some unnecessary complexity. Fixed.


If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.

List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
  if ((i % 2) == 0) {
    String[] atoms = chunks[i].split(",");
    for (int j = 0; j < atoms.length; j++)
      result.add(atoms[j]);
  }
  else
    result.add(chunks[i]);
}


Well,

After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!

But I still looking for one that can found the commas even if there's no "" in the inside terms.

Thankz for the help guyz.


This should do what you want:

(".*")|([a-z])

I didnt check in java but if you test it with http://www.fileformat.info/tool/regex.htm the groups $1 and $2 contain the right values, so they match and you should get what you want. A littlte be trickier this will get if you have other complexer values than a-z in between the commas.

If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for. Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly. By inversing the problem itself, the problem gets often simpler.


I had the same issue. I choose Adam Schmideg answer and improve it.

I had to deal with these 3 string for example :

  1. France (Grenoble, Lyon), Germany (Berlin, Munich)
  2. Italy, Suede, Belgium, Portugal
  3. France, Italy (Torino), Spain (Bercelona, Madrid), Austria

The idea was to have :

  1. France (Grenoble, Lyon) or Germany (Berlin, Munich)
  2. Italy, Suede, Belgium, Portugal
  3. France, Italy (Torino), Spain (Bercelona, Madrid), Austria

I choose not to use regex because I was 100% of what I was doing and that it would work in any case.

String[] chunks = input.split("[()]");
for (int i = 0; i < chunks.length; i++) {
    if ((i % 2) != 0) {
        chunks[i] = "("+chunks[i].replaceAll(",", ";")+")";
    }
}
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < chunks.length; i++) {
    buffer.append(chunks[i]);
}
String s = buffer.toString();
String[] output = s.split(",");
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜