开发者

negated character class matching a sequence of characters

I have a bit of a problem, as I'm a bit of a Regex newbie. If it matters, I'm using .NET's Regex class for this.

First a description of the data I'm using the regex on:

Here is my expression:

(".*[^".]);(.*")

This matches these two lines (they're part of larger rows from a table in a Mainframe computer, but will essentially look something like this):

Example 1: 7906143476.;180.;903.;1.;1970.;8.;42.;1.;327.;"SST9001";"S;T GORANS SJH "

Example 2: 2.;"1;AVD INGENJOR ";"N";"J";" ";

And also matches this (which I would like for it not no match):

;"U";33.75;777.;" ";

The meaning of the regex is to find all semi-colons(;) within quote signs(") and replace them with colons(:). It may not always have a semi-colon before and after the quote signs (as in example 1). The output i want is:

7906143476.;180.;903.;1.;1970.;8.;42.;1.;327.;"SST9001";"S:T GORANS SJH "

2.;"1:AVD INGENJOR ";"N";"J";" ";

And the last line should remain untouched, because it has a dot(.) and any number of numbers before the semi-colon.

I would like to be able to match all of these possible lines with a single regex. I already have a solution with multiple regex's but I would like a better way of doing it. I'm not really familiar with negative/positive lookahead/behind but I have a feeling that the solution is somewhere in that area.

I was first thinking of someting alongside of a grouping inside the negated character class, so that I could negate the .75 which is the first group $1 in the line I don't want matched at all. It could be any number instead of 75 though.

Any help would be great, as I'm no good at regular expressions 开发者_如何学Cat all.

Thank you!


So, in other words, you want semicolons to be replaced with colons only if they are part of a quoted string?

Assuming that quotes are correctly balanced, and that there are no quotes present within quoted strings (as in "2\" by 4\""), then you can do this:

resultString = Regex.Replace(subjectString, 
    @";            # Match a ;
    (?=            # if it's followed by an odd number of ; -- namely:
     [^""\r\n]*    # 0+ non-quote, non-linebreak characters
     ""            # One quote
     (?:           # followed by...
      [^""\r\n]*"" # an even number of non-quote-quote sequences
      [^""\r\n]*""
     )*            # zero or more times
     [^""\r\n]*    # followed by zero or more non-quotes
     $             # until the end of the line.
    )              # End of lookahead", 
    ":", RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);


An alternate suggestion: Split on ';', then go through the array and if the current string starts with " but does not end in ", join it with the next string with an ':' inbetween, continue this until the closing '"' is found or the end of the array reached.

The join all elements with ';' and print.

BTW, couldn't the '"' occur escaped? This would complicate matters a bit for all solutions.


Check this regex:

(?<=("[^"]*";)|([^"];)+)"[^"]*[;][^"]*"

it matches anything between quotes that has at least one semicolon in it, but only, if it was preceded by something else in quotes or by something that is not in quotes. This avoids your problem, I checked it with the strings you provided.


"[^";\n]*?(;)*?[^";\n]*?" , without any lookaround construct..Its performance should be better than other suggested solutions.. What you have to do is just replace group 1 \1 with colon..

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜