开发者

How to create a Pattern matching given set of chars?

I get a set of chars, e.g. as a String containing all of them and need a c开发者_JAVA百科harclass Pattern matching any of them. For example

  • for "abcde" I want "[a-e]"
  • for "[]^-" I want "[-^\\[\\]]"

How can I create a compact solution and how to handle border cases like empty set and set of all chars?

What chars need to be escaped?

Clarification

I want to create a charclass Pattern, i.e. something like "[...]", no repetitions and no such stuff. It must work for any input, that's why I'm interested in the corner cases, too.


Here's a start:

import java.util.*;

public class RegexUtils {

    private static String encode(char c) {
        switch (c) {
            case '[':
            case ']':
            case '\\':
            case '-':
            case '^':
                return "\\" + c;
            default:
                return String.valueOf(c);
        }
    }

    public static String createCharClass(char[] chars) {

        if (chars.length == 0) {
            return "[^\\u0000-\\uFFFF]";
        }

        StringBuilder builder = new StringBuilder();

        boolean includeCaret = false;
        boolean includeMinus = false;

        List<Character> set = new ArrayList<Character>(new TreeSet<Character>(toCharList(chars)));

        if (set.size() == 1<<16) {
            return "[\\w\\W]";
        }

        for (int i = 0; i < set.size(); i++) {

            int rangeLength = discoverRange(i, set);

            if (rangeLength > 2) {
                builder.append(encode(set.get(i))).append('-').append(encode(set.get(i + rangeLength)));
                i += rangeLength;
            } else {
                switch (set.get(i)) {
                    case '[':
                    case ']':
                    case '\\':
                        builder.append('\\').append(set.get(i));
                            break;
                    case '-':
                        includeMinus = true;
                        break;
                    case '^':
                        includeCaret = true;
                        break;
                    default:
                        builder.append(set.get(i));
                        break;
                }
            }
        }

        builder.append(includeCaret ? "^" : "");
        builder.insert(0, includeMinus ? "-" : "");

        return "[" + builder + "]";
    }

    private static List<Character> toCharList(char[] chars) {
        List<Character> list = new ArrayList<Character>();
        for (char c : chars) {
            list.add(c);
        }
        return list;
    }

    private static int discoverRange(int index, List<Character> chars) {
        int range = 0;
        for (int i = index + 1; i < chars.size(); i++) {
            if (chars.get(i) - chars.get(i - 1) != 1) break;
            range++;
        }
        return range;
    }

    public static void main(String[] args) {
        System.out.println(createCharClass("daecb".toCharArray()));
        System.out.println(createCharClass("[]^-".toCharArray()));
        System.out.println(createCharClass("".toCharArray()));
        System.out.println(createCharClass("d1a3e5c55543b2000".toCharArray()));
        System.out.println(createCharClass("!-./0".toCharArray()));
    }
}

As you can see, the input:

"daecb".toCharArray()
"[]^-".toCharArray()
"".toCharArray()
"d1a3e5c55543b2000".toCharArray()

prints:

[a-e]
[-\[\]^]
[^\u0000-\uFFFF]
[0-5a-e]
[!\--0]

The corner cases in a character class are:

  • \
  • [
  • ]

which will need a \ to be escaped. The character ^ doesn't need an escape if it's not placed at the start of a character class, and the - does not need to be escaped when it's placed at the start, or end of the character class (hence the boolean flags in my code).


The empty set is [^\u0000-\uFFFF], and the set of all the characters is [\u0000-\uFFFF]. Not sure what you need the former for as it won't match anything. I'd throw an IllegalArgumentException() on an empty string instead.

What chars need to be escaped?

- ^ \ [ ] - that's all of them, I've actually tested it. And unlike some other regex implementations [ is considered a meta character inside a character class, possibly due to the possibility of using inner character classes with operators.

The rest of task sounds easy, but rather tedious. First you need to select unique characters. Then loop through them, appending to a StringBuilder, possibly escaping. If you want character ranges, you need to sort the characters first and select contiguous ranges while looping. If you want the - to be at the beginning of the range with no escaping, then set a flag, but don't append it. After the loop, if the flag is set, prepend - to the result before wrapping it in [].


Match all characters ".*" (zero or more repeitions * of matching any character . .

Match a blank line "^$" (match start of a line ^ and end of a line $. Note the lack of stuff to match in the middle of the line).

Not sure if the last pattern is exactly what you wanted, as there's different interpretations to "match nothing".


A quick, dirty, and almost-not-pseudo-code answer:

StringBuilder sb = new StringBuilder("[");
Set<Character> metaChars = //...appropriate initialization
while (sourceString.length() != 0) {
 char c = sourceString.charAt(0);
 sb.append(metaChars.contains(c) ? "\\"+c : c);
 sourceString.replace(c,'');
}
sb.append("]");
Pattern p = Pattern.compile(sb.toString());
//...can check here for the appropriate sb.length cases
// e.g, 2 = empty, all chars equals the count of whatever set qualifies as all chars, etc

Which gives you the unique string of char's you want to match, with meta-characters replaced. It will not convert things into ranges (which I think is fine - doing so smells like premature optimization to me). You can do some post tests for simple set cases - like matching sb against digits, non-digits, etc, but unless you know that's going to buy you a lot of performance (or the simplification is the point of this program), I wouldn't bother.

If you really want to do ranges, you could instead sourceString.toCharArray(), sort that, iterate deleting repetitions and doing some sort of range check and replacing meta characters as you add the contents to StringBuilder.

EDIT: I actually kind of liked the toCharArray version, so pseudo-coded it out as well:

//...check for empty here, if not...
char[] sourceC = sourceString.toCharArray();
Arrays.sort(sourceC);
lastC = sourceC[0];
StringBuilder sb = new StringBuilder("[");
StringBuilder range = new StringBuilder();
for (int i=1; i<sourceC.length; i++) {
  if (lastC == sourceC[i]) continue;
  if (//.. next char in sequence..//) //..add to range
  else {
    // check range size, append accordingly to sb as a single item, range, etc
  }
  lastC = sourceC[i];
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜