开发者

Is it a solvable problem to generate a regular expression that matches some input set?

I provide some input set which contains known separated number of text blocks.

I want to make a program that automatically generate 1 or more regular expressions each of which matches every text block in the input set.

I see some relatively easy ways to implement a brute-force search. 开发者_开发技巧But I'm not an expert in compilers theory. That's why I'm curious:

1) is this problem solvable? or there are some principle impossibility to make such algorithm?

2) is it possible to achieve polynomial complexity for this algorithm and avoid brute forcing?


".*" is one-or-more regexp that will match every text block in the input set. ;-)


The problem is, there are a huge number of regular expressions (actually, an infinite number) that will match a given set of inputs. They range from very "greedy" expressions that will match everything

.*

To very non "greedy" expression that will match exactly the input set

InputA OR inputB OR inputC etc

In between those two you can vary the expression in a variety of ways to make it more and less greedy (eg, replace specific digits with an expression which matches any digit, etc).

You'll have to tell us a little more about the problem for us to know where in that range of possible answers is the correct one ;)


Ok, in a comment you clarified that you want to match all strings which are either equal to one of the strings in the input set, or only differ from one of them within a given level of 'variation'. Since you didn't define 'variation' exactly, I'm going to use the Levenshtein distance.

Given a string s and an integer n, you can build the regex, which matches exactly all strings that have a Levenshtein distance of n or less to that string, like this:

First we write a function that given s and n returns a list of simple regexen which when together match all strings with a distance of n or less to s. Here "simple regex" means a regex which only contains literal characters and wildcards.

For n=0 that function just returns [s]. Otherwise it computes the list for n-1 and then goes through each item in it. For each item r and each position i where 0 <= i < length(r), it adds the following regexen to the list:

  • The regex where . has been added to r at position i.
  • The regex where the ith character of r has been deleted.
  • The regex where the ith character of r has been replaced by a ..

Now to calculate the regex for a given set of strings and a given value for n, we calculate the list for each string and then or all the regexen together into one big regex.

Note that this will lead to very big regexen very soon.


http://txt2re.com/ might be what you want.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜