开发者

What's this convoluted regex doing?

I am having tough time figuring this out:

( $dwg, $rev, $rest ) = ($file =~ /^(\d{3}[_-][\w\d]{3}[_-]\d{3,4}(?:[_-]\d{3,4})?)(?:[_ -]\w)?[_ ]{0,5}[rR](?:[eE][vV])开发者_开发百科?(?:\.)? ?([\w\d-]?) *(.*)/);


YAPE::Regex::Explain is a module that accepts as input any regular expression, and as output offers an explanation of what the regex does. Here's an example:

use Modern::Perl;
use YAPE::Regex::Explain;

my $re = qr/^(\d{3}[_-][\w\d]{3}[_-]\d{3,4}(?:[_-]\d{3,4})?)(?:[_ -]\w)?[_ ]{0,5}[rR](?:[eE][vV])?(?:\.)? ?([\w\d-]?) *(.*)/;

say YAPE::Regex::Explain->new($re)->explain();

And here's the output:

The regular expression:

(?-imsx:^(\d{3}[_-][\w\d]{3}[_-]\d{3,4}(?:[_-]\d{3,4})?)(?:[_ -]\w)?[_ ]{0,5}[rR](?:[eE][vV])?(?:\.)? ?([\w\d-]?) *(.*))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
    [_-]                     any character of: '_', '-'
----------------------------------------------------------------------
    [\w\d]{3}                any character of: word characters (a-z,
                             A-Z, 0-9, _), digits (0-9) (3 times)
----------------------------------------------------------------------
    [_-]                     any character of: '_', '-'
----------------------------------------------------------------------
    \d{3,4}                  digits (0-9) (between 3 and 4 times
                             (matching the most amount possible))
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      [_-]                     any character of: '_', '-'
----------------------------------------------------------------------
      \d{3,4}                  digits (0-9) (between 3 and 4 times
                               (matching the most amount possible))
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [_ -]                    any character of: '_', ' ', '-'
----------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  [_ ]{0,5}                any character of: '_', ' ' (between 0 and
                           5 times (matching the most amount
                           possible))
----------------------------------------------------------------------
  [rR]                     any character of: 'r', 'R'
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [eE]                     any character of: 'e', 'E'
----------------------------------------------------------------------
    [vV]                     any character of: 'v', 'V'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    \.                       '.'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
   ?                       ' ' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [\w\d-]?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), digits (0-9), '-'
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
   *                       ' ' (0 or more times (matching the most
                           amount possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

One thing that often makes it easier to decipher a regular expression without resorting to external tools is to put a /x modifier at the end of the regular expression (thus allowing mostly free-form white space within the regex). The /x modifier will allow you to begin inserting whitespace including newlines and tabs into the regex without altering the expression's function. This facilitates grouping portions of the regex together. Of course this isn't going to work out very well if the RE has significant whitespace embedded within it to begin with. In that unusual case you would end up altering the expression's meaning. But for any normal regexp the /x modifier is a first step toward breaking it down into clusters of meaning.

For example, I might get started on your regex like this:

m/^
    (
        \d{3} [_-] [\w\d]{3} [_-] \d{3,4}
        (?:
            [_-] \d{3,4}
        )?
    )
    # ......and so on.
/x

For me, doing this helps me to better visualize what's going on. You can read up on regular expressions in the following POD's: perlrequick (a quickstart guide), perlretut (a more in-depth tutorial), perlre (the definitive source), and perlop. But nothing is so helpful as Jeffrey Friedl's masterpiece book, "Mastering Regular Expressions" (O'Reilly -- Curently in its 3rd edition).

Note: I noticed this RE does seem to have one embedded space near the end. It would be more visible expressed as \x20, and changing it in that way would make it safe to use the /x modifier.


Here is an explanation:

^                   : begining of string
(                   : start group 1; it populates $dwg
    \d{3}           : 3 digit
    [_-]            : _ or - character
    [\w\d]{3}       : 3 alphanum, could be abreviated as \w{3}
    [_-]            : _ or - character
    \d{3,4}         : 3 or 4 digit
    (?:             : start NON capture group
        [_-]        : _ or - character
        \d{3,4}     : 3 or 4 digit
    )?              : end of non capture group optionnal
)                   : end of group 1
(?:                 : start NON capture group
    [_ -]           : _ or space or - character
    \w              : 1 alphanum
)?                  : end of non capture group optionnal
[_ ]{0,5}           : 0 to 5 _ or space char
[rR]                : r or R
(?:                 : start NON capture group
    [eE]            : e or E
    [vV]            : v or V
)?                  : end of non capture group optionnal
(?:\.)?             : a dot not captured optionnal
 ?                  : optionnal space
([\w\d-]?)          : group 2, 1 aphanum or - could be [\w-]; populates $rev
 *                  : 0 or more spaces
(.*)                : any number of any char but linefeed; populates $rest


It looks like it's extracting a date $dwg, a revision $rev, and a suffix $rest from a filename. Broadly, the date can have up to four sections separated by underscores or hyphens, the revision is a series of word characters prefixed with rev (in upper or lowercase), and the suffix contains all characters following the first whitespace after the revision. It's fairly messy, and it looks like it's trying to account for many subtly different cases at once.

^                  # After the start of the string,
(                  # $dwg gets
    \d{3}          # three digits,
    [_-]           # a separator,
    [\w\d]{3}      # three word characters,
    [_-]           # another separator,
    \d{3,4}        # three or four digits,
    (?:            # and
        [_-]       # a separator and
        \d{3,4}    # three or four more digits
    )?             # which are optional.
)
(?:                # Next,
    [_ -]          # another separator,
    \w             # followed by a word character,
)?                 # also optional;
[_ ]{0,5}          # a separator up to five characters long,
[rR]               # then "R" or "r",
(?:
    [eE]           # or "rev" in any mix of case,
    [vV]
)?                 # optionally;
(?:
    \.             # a dot,
)?                 # which too is optional;
 ?                 # and an optional space.
(                  # $rev gets
    [\w\d-]?       # an optional word character or dash.
)
 *                 # Any number of spaces later,
(.*)               # $rest gets the rest.


It's just a complex regular expression, which puts the three catched groups from $file into $dwg, $rev and $rest.

While the regular expression is complex, it doesn't use very complex rules - maybe except (?:something), which is non-catching group.

See this, for example, as an introduction to perl regular expressions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜