Regex and php question, need non-greedy search!
I am having trouble trying to write a non-greedy regex statement.
Here is my string:
<strong>name</strong><strong>address</strong>mailto:blabla@email.com
Here is my regex query:
<st开发者_如何学编程rong>(.*?)</strong>.*?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})
The problem is that I need the the address, not the name from the string. So I need the regex query to be non-greedy and take the closest <strong></strong>
instead of the farthest away.
There are also multiple instances of this in my search string, so it would have to match multiple instances at a time instead of just adding a .*
(greedy) thing in front of it.
So it would have to match all the instances of this, and pull the addresses, not names:
<strong>name</strong><strong>address1</strong>mailto:blabla@email.com
<strong>name</strong><strong>address2</strong>mailto:blabla@email.com
<strong>name</strong><strong>address3</strong>mailto:blabla@email.com
<strong>name</strong><strong>address4</strong>mailto:blabla@email.com
Thanks in advance!
First, regular expressions are a suboptimal tool for matching HTML (this being a good example why this is so). You'll be happier with a parser if you know how to use one (maybe one of the PHP gurus can recommend one).
Having said that, a better way with regexes would probably be to match (and discard) the first <strong>
tag explicitly:
<strong>.*?</strong><strong>(.*?)</strong>.*?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})
This is by no means a good, reliable, bulletproof solution, but at least it works for your sample data.
Or, if you can be more specific about what's allowed between/after the relevant tag, how about this:
<strong>([^<>]*)</strong>(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})
Looking at your test data, here are the rules I infer: If...
- Name and Address are both wrapped in STRONG elements and the email follows immediately, AND
- The STRONG elements' attributes, the name and the addresses all have no angle brackets, AND
- The email address component always begins with
mailto:
, AND - There are no other HTML elements within the two STRONG elements,
Then this tested code should do the trick:
$re = '%
# Capture name and address in <strong> element then email.
<strong[^>]*>\s*([^<>]+)</strong\s*>\s* # $1: Name.
<strong[^>]*>\s*([^<>]+)</strong\s*>\s* # $2: Address.
(mailto:\S+) # $3: Email.
%ix';
$count = preg_match_all($re, $text, $matches);
if ($count) {
printf("%d matches found:\n", $count);
print_r($matches);
for ($i = 0; $i < $count; ++$i) {
printf("Match %d: Name: \"%s\", Address: \"%s\", Email: \"%s\":\n",
$i + 1, $matches[1][$i], $matches[2][$i], $matches[3][$i]);
}
} else {
printf("No matches found.\n");
}
Don't use regular expressions for parsing HTML.
See http://htmlparsing.com/php.html
精彩评论