开发者

Regex Nth match or Last Match if < N matches

I'm trying to find the nth match, or the last match if there are fewer than n. n is determined within my program and the regex string is constructed with 'n' replaced by an integer.

Here is my best guess, but my repetition operator {1,n} is always matching just once. I thought it would be greedy by default

The basic regex would be:
distinctiveString[\s\S]*?value="([^"]*)"

So I modified it to this to try to get the nth one instead
(?:distinctiveString[\s\S]*?){1,n}value="([^"]*)"

distinctiveString randomStuff value="val1"
moreRandomStuff
di开发者_开发知识库stinctiveString randomStuff value="val2"
moreRandomStuff
distinctiveString randomStuff value="val3"
moreRandomStuff
distinctiveString randomStuff value="val4"
moreRandomStuff
distinctiveString randomStuff value="val5"

So in this case what I want is with n = 2 I'd get 'val2', n = 5 I'd get 'val5', n = 8 I would also get 'val5'.

I'm passing my regular expression through an application layer, but I think it's being handed directly to Perl as is.


Try something like this:

(?:(?:[\s\S]*?distinctiveString){4}[\s\S]*?|(?:[\s\S]*distinctiveString)[\s\S]*?)value="([^"]*)"

which would have "val4" in match group 1 or "val3" for the input:

distinctiveString randomStuff value="val1"
moreRandomStuff
distinctiveString randomStuff value="val2"
moreRandomStuff
distinctiveString randomStuff value="val3"

A quick break down of the pattern:

(?:                                         #
  (?:[\s\S]*?distinctiveString){4}[\s\S]*?  # match 4 'distinctiveString's
  |                                         # OR
  (?:[\s\S]*distinctiveString)[\s\S]*?      # match the last 'distinctiveString'
)                                           #
value="([^"]*)"                             #

By looking at your profile, it seems you are most active in the Java tag, so here a small Java demo:

import java.util.regex.*;

public class Main {

    private static String getNthMatch(int n, String text, String distinctive) {
        String regex = String.format(
                "(?xs)                 # enable comments and dot-all           \n" +
                "(?:                   # start non-capturing group 1           \n" +
                "  (?:.*?%s){%d}       #   match n 'distinctive' strings       \n" +
                "  |                   #   OR                                  \n" +
                "  (?:.*%s)            #   match the last 'distinctive' string \n" +
                ")                     # end non-capturing group 1             \n" +
                ".*?value=\"([^\"]*)\" # match the value                       \n",
                distinctive, n, distinctive
        );
        Matcher m = Pattern.compile(regex).matcher(text);
        return m.find() ? m.group(1) : null;
    }

    public static void main(String[] args) throws Exception {
        String text = "distinctiveString randomStuff value=\"val1\" \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val2\"       \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val3\"       \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val4\"       \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val5\"         ";

        String distinctive = "distinctiveString";

        System.out.println(getNthMatch(4, text, distinctive));
        System.out.println(getNthMatch(5, text, distinctive));
        System.out.println(getNthMatch(6, text, distinctive));
        System.out.println(getNthMatch(7, text, distinctive));
    }
}

which will print the following to the console:

val4
val5
val5
val5

Note that the . matches the same as [\s\S] when the dot-all option ((?s)) is enabled.

EDIT

Yes, {1,n} is greedy. However, when you place [\s\S]*? after distinctiveString in (?:distinctiveString[\s\S]*?){1,3}, then distinctiveString is matched and then reluctantly zero or more chars (so zero will be matched) which is then repeated between 1 and 3 times. What you want to do is move [\s\S]*? before distinctiveString:

import java.util.regex.*;

public class Main {

        private static String getNthMatch(int n, String text, String distinctive) {
            String regex = String.format(
                    "(?:[\\s\\S]*?%s){1,%d}[\\s\\S]*?value=\"([^\"]*)\"",
                    distinctive, n
            );
            Matcher m = Pattern.compile(regex).matcher(text);
            return m.find() ? m.group(1) : null;
        }

    public static void main(String[] args) throws Exception {
        String text = "distinctiveString randomStuff value=\"val1\" \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val2\"       \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val3\"       \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val4\"       \n" +
                "moreRandomStuff                                    \n" +
                "distinctiveString randomStuff value=\"val5\"         ";

        String distinctive = "distinctiveString";

        System.out.println(getNthMatch(4, text, distinctive));
        System.out.println(getNthMatch(5, text, distinctive));
        System.out.println(getNthMatch(6, text, distinctive));
        System.out.println(getNthMatch(7, text, distinctive));
    }
}

which also prints:

val4
val5
val5
val5
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜