开发者

Generating the shortest regex to match an arbitrary word list

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).

For example, suppose my list is

1231
1233
1234
1236
1开发者_JS百科238
1247
1256
1258
1259

Then the output should be:

12(3[13468]|47|5[589])


This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm

It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.

For example, this Perl command-line:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"

generates this output:

(?^:(?^:12(?:3[13468]|5[689]|47)))

(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.

Here's another example:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"

And the output:

(?^:(?^:3(?:[1238]|57)4))

For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:

3([1238]|57)4


You are probably better off saving the entire list, or if you want to get fancy, create a Trie:

1231
1234
1247

    1
    |
    2 
   / \
  3   4
 / \   \
1   4   7

Now when you take a string check if it reaches a leaf node. It does, it's valid.

If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.


You can also use the trie to generate the regex if you really like the regex idea:

  1. Nodes from the root to the first branching are fixed (eg: 12)

  2. Branches create |: (eg: 12(3|4)

  3. Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))

This might not generate the shortest regex, to do that you'll might some extra work:

  1. "Compact" ranges if you find them (eg [12345] becomes [1-4])

  2. Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}

  3. ???

I really don't think it's worth it to generate the regex.


This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy

It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses. It only uses "|", non-capturing group "(?:)" and option "?". There's room for improvement when there's a row of single characters: Instead of e.g. (?:3|8|1|6|4) it could generate [38164].

The generated regexp could easily be adapted to other regexp dialects.

Sample usage:

java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:5(?:6|9|8)|47|3(?:3|8|1|6|4))


Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.

I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.

var list = new listChar("");

function listChar(s, p) {
    this.char = s;
    this.depth = 0;
    this.parent = p;
    this.add = function(n) {
        if (!this.subList) {
            this.subList = {};
            this.increaseDepth();
        }
        if (!this.subList[n]) {
            this.subList[n] = new listChar(n, this);
        }
        return this.subList[n];
    }
    this.toString = function() {
        var ret = "";
        var subVals = [];
        if (this.depth >=1) {
            for (var i in this.subList) {
                subVals[subVals.length] = this.subList[i].toString();
            }
        }
        if (this.depth === 1 && subVals.length > 1) {
            ret = "[" + subVals.join("") + "]";
        } else if (this.depth === 1 && subVals.length === 1) {
            ret = subVals[0];
        } else if (this.depth > 1) {
            ret = "(" + subVals.join("|") + ")";
        }
        return this.char + ret;
    }
    this.increaseDepth = function() {
        this.depth++;
        if (this.parent) {
            this.parent.increaseDepth();
        }
    }
}

function wordList(input) {
    var listStep = list;
    while (input.length > 0) {
        var c = input.charAt(0);
        listStep = listStep.add(c);
        input = input.substring(1);
    }
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
    wordList(words[i]);
}

document.write(list.toString());

Using

words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];

Here's the output:

(1(2(3[13468]|47|5[689])))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜