RegEx a RegEx match
I'm having trouble building the correct regex for my string. What I want to do is get all entities from my string; they start and end with '
. The entities are identifiable by an amount of numbers and a #
in front. However, entities (in this case a phone number starting with #
) that don't start or end with '
should not be m开发者_如何学Pythonatched at all.
I hope someone can help me, or at least tell me that what I want to do isn't possible in one regex. Thanks :)
String:
'Blaa lablalbl balbla balb lbal '#39'blaaaaaaaa'#39' ('#39#226#8218#172#39') blaaaaaaaa #7478347878347834 blaaaa blaaaa'
RegEx:
'[#[0-9]+]*'
Wanted matches:
'#39'
'#39'
'#39'
'#226'
'#8218'
'#172'
'#39'
Found matches:
'#39'
'#39'
'#39#226#8218#172#39'
<- Needs to be split(if possible in the same RegEx)
Another RegEx:
#[0-9]+
Found matches:
'#39'
'#39'
'#39'
'#226'
'#8218'
'#172'
'#39'
'#7478347878347834'
<- Should not be here :(
Language: C# .NET (4.0)
You cannot do this in one regex, you'll need two:
First take all matches that are between single quotes:
'[\d#]+'
Then over all those matches, do this:
#\d+
So you'll end up with something like (in C#):
foreach(var m in Regex.Matches(inputString, @"'[\d#]+'"))
{
foreach(var m2 in Regex.Matches(m.Value, @"#\d+"))
{
yield return m2.Value;
}
}
Assuming you can use lookbehind/lookaheads and that your regexp supports variable length lookbehinds (JGSoft / .NET only)
(?<='[#0-9]*)#\d+(?=[#0-9]*')
Should work... Tested it using this site and got these results:
1. #39
2. #39
3. #39
4. #226
5. #8218
6. #172
7. #39
Breaking it down is pretty simple:
(?<= # Start positive lookbehind group - assure that the text before the cursor
# matches the following pattern:
' # Match the literal '
[#0-9]* # Matches #, 0-9, zero or more times
) # End lookbehind...
#\d+ # Match literal #, followed by one or more digits
(?= # Start lookahead -- Ensures text after cursor matches (without advancing)
[#0-9]* # Allow #, 0-9, zero or more times
' # Match a literal '
)
So, this pattern will match #\d+
if the text before it is '[#0-9]*
and the text after is [#0-9]*'
As you don't specify a language, here is a solution in perl :
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = qq!Blaa lablalbl balbla balb lbal '#39'blaaaaaaaa'#39' ('#39#226#8218#172#39') blaaaaaaaa #7478347878347834 blaaaa blaaaa!;
my @n = $s =~ /(?<=['#\d])(#\d+)(?=[#'\d])/g;
print Dumper(\@n);
Output :
$VAR1 = [
'#39',
'#39',
'#39',
'#226',
'#8218',
'#172',
'#39'
];
精彩评论