Is it a solvable problem to generate a regular expression that matches some input set?

2023-02-01 02:01 问答作者：

I provide some input set which contains known separated number of text blocks.

I want to make a program that automatically generate 1 or more regular expressions each of which matches every text block in the input set.

I see some relatively easy ways to implement a brute-force search. 开发者_开发技巧But I'm not an expert in compilers theory. That's why I'm curious:

1) is this problem solvable? or there are some principle impossibility to make such algorithm?

2) is it possible to achieve polynomial complexity for this algorithm and avoid brute forcing?

".*" is one-or-more regexp that will match every text block in the input set. ;-)

The problem is, there are a huge number of regular expressions (actually, an infinite number) that will match a given set of inputs. They range from very "greedy" expressions that will match everything

.*

To very non "greedy" expression that will match exactly the input set

InputA OR inputB OR inputC etc

In between those two you can vary the expression in a variety of ways to make it more and less greedy (eg, replace specific digits with an expression which matches any digit, etc).

You'll have to tell us a little more about the problem for us to know where in that range of possible answers is the correct one ;)

Ok, in a comment you clarified that you want to match all strings which are either equal to one of the strings in the input set, or only differ from one of them within a given level of 'variation'. Since you didn't define 'variation' exactly, I'm going to use the Levenshtein distance.

Given a string s and an integer n, you can build the regex, which matches exactly all strings that have a Levenshtein distance of n or less to that string, like this:

First we write a function that given s and n returns a list of simple regexen which when together match all strings with a distance of n or less to s. Here "simple regex" means a regex which only contains literal characters and wildcards.

For n=0 that function just returns [s]. Otherwise it computes the list for n-1 and then goes through each item in it. For each item r and each position i where 0 <= i < length(r), it adds the following regexen to the list:

The regex where . has been added to r at position i.
The regex where the ith character of r has been deleted.
The regex where the ith character of r has been replaced by a ..

Now to calculate the regex for a given set of strings and a given value for n, we calculate the list for each string and then or all the regexen together into one big regex.

Note that this will lead to very big regexen very soon.

http://txt2re.com/ might be what you want.

继续阅读：algorithm compiler-theory regex

Is it a solvable problem to generate a regular expression that matches some input set?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？