开发者

RegEx a RegEx match

I'm having trouble building the correct regex for my string. What I want to do is get all entities from my string; they start and end with '. The entities are identifiable by an amount of numbers and a # in front. However, entities (in this case a phone number starting with #) that don't start or end with ' should not be m开发者_如何学Pythonatched at all.

I hope someone can help me, or at least tell me that what I want to do isn't possible in one regex. Thanks :)

String:

'Blaa lablalbl balbla balb lbal '#39'blaaaaaaaa'#39' ('#39#226#8218#172#39') blaaaaaaaa #7478347878347834 blaaaa blaaaa'

RegEx:

'[#[0-9]+]*'

Wanted matches:

  • '#39'
  • '#39'
  • '#39'
  • '#226'
  • '#8218'
  • '#172'
  • '#39'

Found matches:

  • '#39'
  • '#39'
  • '#39#226#8218#172#39' <- Needs to be split(if possible in the same RegEx)

Another RegEx:

#[0-9]+

Found matches:

  • '#39'
  • '#39'
  • '#39'
  • '#226'
  • '#8218'
  • '#172'
  • '#39'
  • '#7478347878347834' <- Should not be here :(

Language: C# .NET (4.0)


You cannot do this in one regex, you'll need two:

First take all matches that are between single quotes:

'[\d#]+'

Then over all those matches, do this:

#\d+

So you'll end up with something like (in C#):

foreach(var m in Regex.Matches(inputString, @"'[\d#]+'"))
{
    foreach(var m2 in Regex.Matches(m.Value, @"#\d+"))
    {
          yield return m2.Value;
    }
}


Assuming you can use lookbehind/lookaheads and that your regexp supports variable length lookbehinds (JGSoft / .NET only)

(?<='[#0-9]*)#\d+(?=[#0-9]*')

Should work... Tested it using this site and got these results:

   1. #39
   2. #39
   3. #39
   4. #226
   5. #8218
   6. #172
   7. #39

Breaking it down is pretty simple:

(?<=        # Start positive lookbehind group - assure that the text before the cursor
            # matches the following pattern: 
  '         # Match the literal '
  [#0-9]*   # Matches #, 0-9, zero or more times
)           # End lookbehind...
#\d+        # Match literal #, followed by one or more digits
(?=         # Start lookahead -- Ensures text after cursor matches (without advancing)
  [#0-9]*   # Allow #, 0-9, zero or more times
  '         # Match a literal '
)

So, this pattern will match #\d+ if the text before it is '[#0-9]* and the text after is [#0-9]*'


As you don't specify a language, here is a solution in perl :

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $s = qq!Blaa lablalbl balbla balb lbal '#39'blaaaaaaaa'#39' ('#39#226#8218#172#39') blaaaaaaaa #7478347878347834 blaaaa blaaaa!;

my @n = $s =~ /(?<=['#\d])(#\d+)(?=[#'\d])/g;

print Dumper(\@n);

Output :

$VAR1 = [
          '#39',
          '#39',
          '#39',
          '#226',
          '#8218',
          '#172',
          '#39'
        ];
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜