How to create a Pattern matching given set of chars?

2023-02-14 09:54 问答作者：

I get a set of chars, e.g. as a String containing all of them and need a c开发者_JAVA百科harclass Pattern matching any of them. For example

for "abcde" I want "[a-e]"
for "[]^-" I want "[-^\\[\\]]"

How can I create a compact solution and how to handle border cases like empty set and set of all chars?

What chars need to be escaped?

Clarification

I want to create a charclass Pattern, i.e. something like "[...]", no repetitions and no such stuff. It must work for any input, that's why I'm interested in the corner cases, too.

Here's a start:

import java.util.*;

public class RegexUtils {

    private static String encode(char c) {
        switch (c) {
            case '[':
            case ']':
            case '\\':
            case '-':
            case '^':
                return "\\" + c;
            default:
                return String.valueOf(c);
        }
    }

    public static String createCharClass(char[] chars) {

        if (chars.length == 0) {
            return "[^\\u0000-\\uFFFF]";
        }

        StringBuilder builder = new StringBuilder();

        boolean includeCaret = false;
        boolean includeMinus = false;

        List<Character> set = new ArrayList<Character>(new TreeSet<Character>(toCharList(chars)));

        if (set.size() == 1<<16) {
            return "[\\w\\W]";
        }

        for (int i = 0; i < set.size(); i++) {

            int rangeLength = discoverRange(i, set);

            if (rangeLength > 2) {
                builder.append(encode(set.get(i))).append('-').append(encode(set.get(i + rangeLength)));
                i += rangeLength;
            } else {
                switch (set.get(i)) {
                    case '[':
                    case ']':
                    case '\\':
                        builder.append('\\').append(set.get(i));
                            break;
                    case '-':
                        includeMinus = true;
                        break;
                    case '^':
                        includeCaret = true;
                        break;
                    default:
                        builder.append(set.get(i));
                        break;
                }
            }
        }

        builder.append(includeCaret ? "^" : "");
        builder.insert(0, includeMinus ? "-" : "");

        return "[" + builder + "]";
    }

    private static List<Character> toCharList(char[] chars) {
        List<Character> list = new ArrayList<Character>();
        for (char c : chars) {
            list.add(c);
        }
        return list;
    }

    private static int discoverRange(int index, List<Character> chars) {
        int range = 0;
        for (int i = index + 1; i < chars.size(); i++) {
            if (chars.get(i) - chars.get(i - 1) != 1) break;
            range++;
        }
        return range;
    }

    public static void main(String[] args) {
        System.out.println(createCharClass("daecb".toCharArray()));
        System.out.println(createCharClass("[]^-".toCharArray()));
        System.out.println(createCharClass("".toCharArray()));
        System.out.println(createCharClass("d1a3e5c55543b2000".toCharArray()));
        System.out.println(createCharClass("!-./0".toCharArray()));
    }
}

As you can see, the input:

"daecb".toCharArray()
"[]^-".toCharArray()
"".toCharArray()
"d1a3e5c55543b2000".toCharArray()

prints:

[a-e]
[-\[\]^]
[^\u0000-\uFFFF]
[0-5a-e]
[!\--0]

The corner cases in a character class are:

which will need a \ to be escaped. The character ^ doesn't need an escape if it's not placed at the start of a character class, and the - does not need to be escaped when it's placed at the start, or end of the character class (hence the boolean flags in my code).

The empty set is [^\u0000-\uFFFF], and the set of all the characters is [\u0000-\uFFFF]. Not sure what you need the former for as it won't match anything. I'd throw an IllegalArgumentException() on an empty string instead.

What chars need to be escaped?

- ^ \ [ ] - that's all of them, I've actually tested it. And unlike some other regex implementations [ is considered a meta character inside a character class, possibly due to the possibility of using inner character classes with operators.

The rest of task sounds easy, but rather tedious. First you need to select unique characters. Then loop through them, appending to a StringBuilder, possibly escaping. If you want character ranges, you need to sort the characters first and select contiguous ranges while looping. If you want the - to be at the beginning of the range with no escaping, then set a flag, but don't append it. After the loop, if the flag is set, prepend - to the result before wrapping it in [].

Match all characters ".*" (zero or more repeitions * of matching any character . .

Match a blank line "^$" (match start of a line ^ and end of a line $. Note the lack of stuff to match in the middle of the line).

Not sure if the last pattern is exactly what you wanted, as there's different interpretations to "match nothing".

A quick, dirty, and almost-not-pseudo-code answer:

StringBuilder sb = new StringBuilder("[");
Set<Character> metaChars = //...appropriate initialization
while (sourceString.length() != 0) {
 char c = sourceString.charAt(0);
 sb.append(metaChars.contains(c) ? "\\"+c : c);
 sourceString.replace(c,'');
}
sb.append("]");
Pattern p = Pattern.compile(sb.toString());
//...can check here for the appropriate sb.length cases
// e.g, 2 = empty, all chars equals the count of whatever set qualifies as all chars, etc

Which gives you the unique string of char's you want to match, with meta-characters replaced. It will not convert things into ranges (which I think is fine - doing so smells like premature optimization to me). You can do some post tests for simple set cases - like matching sb against digits, non-digits, etc, but unless you know that's going to buy you a lot of performance (or the simplification is the point of this program), I wouldn't bother.

If you really want to do ranges, you could instead sourceString.toCharArray(), sort that, iterate deleting repetitions and doing some sort of range check and replacing meta characters as you add the contents to StringBuilder.

EDIT: I actually kind of liked the toCharArray version, so pseudo-coded it out as well:

//...check for empty here, if not...
char[] sourceC = sourceString.toCharArray();
Arrays.sort(sourceC);
lastC = sourceC[0];
StringBuilder sb = new StringBuilder("[");
StringBuilder range = new StringBuilder();
for (int i=1; i<sourceC.length; i++) {
  if (lastC == sourceC[i]) continue;
  if (//.. next char in sequence..//) //..add to range
  else {
    // check range size, append accordingly to sb as a single item, range, etc
  }
  lastC = sourceC[i];
}

继续阅读：regex

How to create a Pattern matching given set of chars?

Clarification

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Clarification

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？