extract word with regular expression
I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77
and I would like to extract the words 开发者_开发百科temperatoA
and CelcieusB
.
I have this regular expression (\d+/(\w+),?)*!
but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB'
because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,'
fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB'
(the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB'
replaces '1/temperatoA,'
as $1
, so $1
reads '2/CelcieusB'
.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1
, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) {
print Dumper( \@matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3'
and '6'
are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}
, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA'
or 'CelcieusB'
, then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}
.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/)
asserts that there is a digit and a slash before the start of the match
\w+
matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+
instead.
(?=.*!)
asserts that there is a !
ahead in the string - i. e. the regex will fail once we have passed the !
.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
- A word begins with an alphabetic character.
- A word always immediately follows a slash. No intervening spaces, no words in the middle.
- Words after the exclamation point are ignored.
- You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]]
matches any alphabetic character.
The \b
matches a word boundary.
And the (?=.*!)
came from Tim Pietzcker's post.
精彩评论