extract word with regular expression

2023-01-02 09:08 问答作者：

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words 开发者_开发百科temperatoA and CelcieusB.

I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!

Why?

Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:

qr{ (       # begin group 
      \d+   # at least one digit
      /     # followed by a slash
     (\w+)  # followed by at least one word characters
     ,?     # maybe a comma
    )*      # ANY number of repetitions of this pattern.
}x;

'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.

Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.

When I do this:

my $str   = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) { 
    print Dumper( \@matches );
}

I get this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB',
          '23/33',
          '33',
          '55/66',
          '66'
        ];

Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.

So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB'
        ];

And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.

However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.

The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?

The expression you want is simply as follows:

(\w+)

With a Perl-compatible regex engine you can search for

(?<=\d/)\w+(?=.*!)

(?<=\d/) asserts that there is a digit and a slash before the start of the match

\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.

(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.

Depending on the language you're using, you might need to escape some of the characters in the regex.

E. g., for use in C (with the PCRE library), you need to escape the backslashes:

myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);

Will this work?

/([[:alpha:]]\w+)\b(?=.*!)

I made the following assumptions...

A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.

[[:alpha:]] matches any alphabetic character.

The \b matches a word boundary.

And the (?=.*!) came from Tim Pietzcker's post.

继续阅读：pcre perl regex

extract word with regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？