开发者

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words 开发者_开发百科temperatoA and CelcieusB.

I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!

Why?


Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:

qr{ (       # begin group 
      \d+   # at least one digit
      /     # followed by a slash
     (\w+)  # followed by at least one word characters
     ,?     # maybe a comma
    )*      # ANY number of repetitions of this pattern.
}x;

'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.

Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.

When I do this:

my $str   = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) { 
    print Dumper( \@matches );
}

I get this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB',
          '23/33',
          '33',
          '55/66',
          '66'
        ];

Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.

So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB'
        ];

And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.

However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.


The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?

The expression you want is simply as follows:

(\w+)


With a Perl-compatible regex engine you can search for

(?<=\d/)\w+(?=.*!)

(?<=\d/) asserts that there is a digit and a slash before the start of the match

\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.

(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.

Depending on the language you're using, you might need to escape some of the characters in the regex.

E. g., for use in C (with the PCRE library), you need to escape the backslashes:

myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);


Will this work?

/([[:alpha:]]\w+)\b(?=.*!)

I made the following assumptions...

  1. A word begins with an alphabetic character.
  2. A word always immediately follows a slash. No intervening spaces, no words in the middle.
  3. Words after the exclamation point are ignored.
  4. You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.

[[:alpha:]] matches any alphabetic character.

The \b matches a word boundary.

And the (?=.*!) came from Tim Pietzcker's post.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜