开发者

Split string of varying length using regex

I don't know if this is possible using regex. I'm just asking in case someone knows the answer.

I have a string ="hellohowareyou??". I need to split it like this

[h, el, loh, owar, eyou?, ?].

The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.

public Array开发者_运维知识库List<String> splitString(String s)
    {
        int cnt=0,i;
        ArrayList<String> sList=new ArrayList<String>();
        for(i=0;i+cnt<s.length();i=i+cnt)
        {
         cnt++;
         sList.add(s.substring(i,i+cnt));    
        }
        sList.add(s.substring(i,s.length()));
        return sList;
    }

I was just curious whether such a thing can be done using regex.


Solution

The following snippet generates the pattern that does the job (see it run on ideone.com):

// splits at indices that are triangular numbers
class TriangularSplitter {
 
  // asserts that the prefix of the string matches pattern
  static String assertPrefix(String pattern) {
    return "(?<=(?=^pattern).*)".replace("pattern", pattern);
  }
  // asserts that the entirety of the string matches pattern
  static String assertEntirety(String pattern) {
    return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
  }
  // repeats an assertion as many times as there are dots behind current position
  static String forEachDotBehind(String assertion) {
    return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
  }
  
  public static void main(String[] args) {
    final String TRIANGULAR_SPLITTER =
      "(?x) (?<=^.) | measure (?=(.*)) check"
        .replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
        .replace("notGyet", assertPrefix("(?! \\1 \\G)"))
        .replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
        .replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
        .replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
        ;
    String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    System.out.println(
        java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
    );
    // [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
  }
}

Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.

References

This is a brief description of the basic regex constructs used:

  • (?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
  • ^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
  • | denotes alternation (i.e. "or").
  • ? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
  • (…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
  • (?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
  • (?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.

Related questions

  • Articles in the [nested-reference] series:
    • How does this regex find triangular numbers?
    • How can we match a^n b^n with Java regex?
    • How does this Java regex detect palindromes?
  • How does the regular expression (?<=#)[^#]+(?=#) work?

Explanation

The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:

  • (?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
    • This matches at index 1, and is a crucial starting point to the rest of the process
  • Otherwise, we measure to reconstruct how the last match was made (using \G as reference point), storing the result of the measurement in "before" \G and "after" \G capturing groups. We then check if the current position is the one prescribed by the measurement to find where the next match should be made.

Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:

  • \1 captures the string "before" \G
  • \2 captures some string "after" \G
  • If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
    • Hence \2 . has length k+1 and should be the next part in our split!
  • \3 captures the string to the right of our current position
    • Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G

You can use mathematical induction to rigorously prove the correctness of this algorithm.

To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].

          \G     we now need to match here!
           ↓       ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/   <--- \1 G \2 . \3
  L=1+2+3    L=3           

Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.

Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.

To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:

  • +NBefore, i.e. we extend \1 by one forEachDotBehind
  • +1After, i.e. we extend \2 by just one

Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.


Conclusion

Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).

There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.

That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.


Regex purpose is to recognize patterns. Here you doesn't search for patterns but for a length split. So regex are not appropriate.

It is propably possible, but not with a single regex : to find the first n characters using a regex, you use: "^(.{n}).*"

So, you can search with that regex for the 1st character. Then, you make a substring, and you search for the 2 next characters. Etc.

Like @splash said, it will make the code more complicated, and unefficient, since you use regex for something outside of their purpose.


String a = "hellohowareyou??";
int i = 1;

    while(true) {

        if(i >= a.length()) {
            System.out.println(a);
            break;
        }

        else {
            String b = a.substring(i++);
            String[] out = a.split(Pattern.quote(b) + "$");
            System.out.println(out[0]);
            a = b;
            if(b.isEmpty())
                break;
        }

    }
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜