开发者

Extract sub-string between two certain words using regex in java

I would like to extract sub-string between certain two words using java.

For example:

This is an important example about regex for my work.

I would like to extract everything between "an" and "for".

What I did so far is:

String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);

boolean found = false;
while (matcher.find()) {
    System.out.println("I found the text: " + matcher.group().toString());
    found = true;
}
if (!found) {
    System.out.println("I didn't found the text");
}

It works well.

But I want to do two additional things

  1. If the sentence is: This is an important example about regex for my work and for me. I want to extract till the first "for" i.e. important example about 开发者_运维百科regex

  2. Some times I want to limit the number of words between the pattern to 3 words i.e. important example about

Any ideas please?


For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.

(?<=an).*?(?=for)

I have no idea what the additional . at the end is good for in .*. its unnecessary.

For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this

\S+\s

and repeat this 3 times like this

(?<=an)\s(\S+\s){3}(?=for)

To ensure that the pattern mathces on whole words use word boundaries

(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)

See it online here on Regexr

{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}

Alternative:

As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups

You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.

\ban\b(.*?)\bfor\b

See it online here on Regexr

You can than access this group like this

System.out.println("I found the text: " + matcher.group(1).toString());
                                                        ^

You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.


Your regex is "an\\s+(.*?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern .* to eat everything including word "for".


public class SubStringBetween {

public static String subStringBetween(String sentence, String before, String after) {

    int startSub = SubStringBetween.subStringStartIndex(sentence, before);
    int stopSub = SubStringBetween.subStringEndIndex(sentence, after);

    String newWord = sentence.substring(startSub, stopSub);
    return newWord;
}

public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0, y = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterBeforeWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterBeforeWord)) {
                x = startIndex;
            }
        }
    }
    return x;
}

public static int subStringEndIndex(String sentence, String delimiterAfterWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterAfterWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterAfterWord)) {
                x = startIndex;
                x = x - delimiterAfterWord.length();
            }
        }
    }
    return x;
}

}

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜