there's a way to sort a regular expressions list by specificity?
I'm looking for something that allows me to sort a list of regular expression, or some documentation and re开发者_StackOverflowsearch,
according to their specificity/strictness
/[a-z]+/ // most strict
/[a-z0-9]+/
/[a-z0-9èòà]+/ // less strict
/.*/
but how about
/[a-z]+ABC/
/[a-z0-9]+/
which one is less specific than the other?
thank you in advance
One can equate a regular expression to the set of strings it matches (called a 'regular language'.) If our regular expression is named E
, let's call its matching strings L(E)
.
Strictness in the sense you are alluding to above then becomes the subset relation: define RE A
to be stricter than RE B
if L(A)
is a proper subset of L(B)
. This puts to rest ambiguities like synonyms for the "same" RE: they are the same precisely because they have the same regular language.
As @yi_H points out, the subset relation over RE languages (over some common alphabet) forms a partial ordering. You sound like you want a total ordering. If so, you can stipulate that an acceptable total ordering should embed the partial ordering represented by the subset relation.
I don't have a clear answer for how to build that total ordering, but two approaches come to mind.
The first is to exploit the pumping lemma. It turns out that for any RE, if it matches a sufficiently long string, then it must also match a longer string constructible from the first by repeating some subsection. You could ask what is the length of the longest matching string that does not have any such repeating segments, and make that your metric. Maybe that respects (embeds) the partial ordering, maybe it doesn't.
The other is to consider graph transformations on the RE's state machine. I suspect (but I don't have any reference) that if RE A
is properly stricter than RE B
, then B
's automaton will be calculable from A
's by collapsing states or some similar simplifying action. You could define your metric to be the number of states in the RE's smallest automaton.
As your second example shows you cannot have a total ordering of regular expressions, only a partial order is possible.
To make things even worse, there are dozens of ways you can write the same regular expression: [ab]b
vs (ab|bb)
, aa*
vs a+
. So even deciding whether two regexpes are equivalent is not a simple task.
Assuming you're talking about pure regular expressions, rather than the crazy perl stuff, you can define a partial order on regular expressions that matches your question, based on the set of strings they accept (i.e., view the regular expression as a regular language).
Given that the difference, intersection, and emptiness of regular languages are decidable problems, that means there are algorithms that will tell you if one of your expressions accepts all the strings another one does.
精彩评论