开发者

Regex and php question, need non-greedy search!

I am having trouble trying to write a non-greedy regex statement.

Here is my string:

<strong>name</strong><strong>address</strong>mailto:blabla@email.com

Here is my regex query:

<st开发者_如何学编程rong>(.*?)</strong>.*?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})

The problem is that I need the the address, not the name from the string. So I need the regex query to be non-greedy and take the closest <strong></strong> instead of the farthest away.

There are also multiple instances of this in my search string, so it would have to match multiple instances at a time instead of just adding a .* (greedy) thing in front of it.

So it would have to match all the instances of this, and pull the addresses, not names:

   <strong>name</strong><strong>address1</strong>mailto:blabla@email.com
   <strong>name</strong><strong>address2</strong>mailto:blabla@email.com
   <strong>name</strong><strong>address3</strong>mailto:blabla@email.com
   <strong>name</strong><strong>address4</strong>mailto:blabla@email.com

Thanks in advance!


First, regular expressions are a suboptimal tool for matching HTML (this being a good example why this is so). You'll be happier with a parser if you know how to use one (maybe one of the PHP gurus can recommend one).

Having said that, a better way with regexes would probably be to match (and discard) the first <strong> tag explicitly:

<strong>.*?</strong><strong>(.*?)</strong>.*?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})

This is by no means a good, reliable, bulletproof solution, but at least it works for your sample data.

Or, if you can be more specific about what's allowed between/after the relevant tag, how about this:

<strong>([^<>]*)</strong>(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})


Looking at your test data, here are the rules I infer: If...

  1. Name and Address are both wrapped in STRONG elements and the email follows immediately, AND
  2. The STRONG elements' attributes, the name and the addresses all have no angle brackets, AND
  3. The email address component always begins with mailto:, AND
  4. There are no other HTML elements within the two STRONG elements,

Then this tested code should do the trick:

$re = '%
    # Capture name and address in <strong> element then email.
    <strong[^>]*>\s*([^<>]+)</strong\s*>\s*  # $1: Name.
    <strong[^>]*>\s*([^<>]+)</strong\s*>\s*  # $2: Address.
    (mailto:\S+)                             # $3: Email.
    %ix';
$count = preg_match_all($re, $text, $matches);
if ($count) {
    printf("%d matches found:\n", $count);
    print_r($matches);
    for ($i = 0; $i < $count; ++$i) {
        printf("Match %d: Name: \"%s\", Address: \"%s\", Email: \"%s\":\n",
            $i + 1, $matches[1][$i], $matches[2][$i], $matches[3][$i]);
    }
} else {
    printf("No matches found.\n");
}


Don't use regular expressions for parsing HTML.

See http://htmlparsing.com/php.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜