开发者

Regular expression listing all possibilities

Given a regular expression, how开发者_Go百科 can I list all possible matches? For example: AB[CD]1234, I want it to return a list like: ABC1234 ABD1234

I searched the web, but couldn't find anything.


Exrex can do this:

$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234


The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:

Consider following expressions:

AB[A-Z0-9]{1,10}1234

AB.*1234

I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.


For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.

One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.

For an expression like "A[BC][DE]F", the method will proceed as follows

getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF") 
   + getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"


It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:

  • Optional: ?
  • Characters: . \d \D
  • Sets: like [1a-c]
  • Negated sets: [^2-9d-z]
  • Alternations: |
  • Positive lookarounds

So your regexes could NOT use:

  • Repeaters: * +
  • Word patterns: \w \W
  • Negative lookarounds
  • Some zero-width assertions: ^ $

And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.

As for the algorithm itself, another user posted a link to this answer which describes how to create it.


Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted


A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.


Impossible.

Really.

Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?


It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.

For example AB.*1234 would be AB followed by absolutely anything and then 1234.


I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.

For instance, try to make a list of all matches for A.*Z

There are sites that help with building a good regular expression though:

  • http://www.fileformat.info/tool/regex.htm
  • http://www.regular-expressions.info/javascriptexample.html
  • http://www.regextester.com/
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜