why would a functional regex fail using PHP's preg_match_all()?

2023-03-27 13:57 问答作者：

I have the following regex in a PHP script

$total_matches = preg_match_all('{

        <a\shref="
        (?<link>[^"]+)
        "(?:(?!src=).)+src="
        (?<image>[^"]+)
        (?:开发者_StackOverflow中文版(?!designer-name">).)+designer-name">
        (?<brand>[^<]+)
        (?:(?!title=).)+title="
        (?<title>((?!">).)+)
        (?:(?!"price">).)+"price">\$
        (?<price>[\d.,]+)

}xsi',$output,$all_matches,PREG_SET_ORDER);

this regex seems to work fine when parsing the following (via php or using the parser at regexr.com (with the same options set for case insensitive, extended, treat line breaks as whitespace):

<a href="http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
  "DORDOGNE 120 PLATEAU SANDALEN" class="product-image">
  <img class="image1st" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/small_  image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-STANDARD.jpg"
   width="230" height="260" 
   alt=   "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" 
   title= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> 
 <img class="image2nd" src=  "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-DETAIL_2.jpg"
width="230" height="260" alt=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" title=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> <span class=
"availability"><strong>available sizes</strong><br /></span></a>

<div style="margin-left: 2em" class="available-sizes">
<h2 class="designer-name">Christian Louboutin</h2>

<div class="product-buttons">
  <div class="product-button">
    NEW ARRIVAL
  </div>

  <div class="clearer"></div>
</div>

<h3 class="product-name"><a href=
"http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
"DORDOGNE 120 SANDALS">DORDOGNE 120 SANDALS</a></h3>

<div class="price-box">
  <span class="regular-price" id="product-price-114114"><span class=
  "price">$805.00</span></span>
</div>

If I try to parse several matches in a row, it works fine also. However when I try parsing the full page these matches come from (I have permission to parse this)

http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12

the regex fails (I actually get a 500 error). I've tried increasing the backtrack limit using

ini_set('pcre.backtrack_limit',100000000);
ini_set('pcre.recursion_limit',100000000);

but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up. I am running PHP 5.2.17.

You have made one of the classic blunders! Don't use regex to parse HTML! It breaks regex! (This is right after "Never get involved in a land war in Asia" and "Never go in against a Sicilian when death is on the line.").

You should be using SimpleXML or DomDocument to parse this:

$dom = new DomDocument();
$dom->loadHTML( 'http://www.mytheresa.com/us_en/new-arrivals/'.
                 'what-s-new-this-week-1.html?limit=12' );

$path = new DomXPath( $dom );
// this query is based on the link you provided, not your regex
$nodes = $path->evaluate( '//ul[class="products-grid first odd"]/li' );
foreach( $nodes as $node )
{
    // children 0 = anchor tag you're looking for initially.
    echo $node->children[0]->getAttribute("href");
    // iterate through the other children that way
}

Those negative lookaheads are clever, but then... slightly too clever.

And I concur, you used too many of them, not to get side effects.

Can't see which one is running wild right now, but putting a repeated . like that... is always bound to give you greediness problems.

this one for example, is certainly unnecessary:

title="
(?<title>((?!">).)

as you could have written it

title="(?<title>.*?)">

...there are more like it. I'd change them all.

In general, regex debugging implies rephrasing and rephrasing it again and again and again, using different constructs until you find the right balance between functionality and mantainability.

Another thing: I would use <a\s+ instead of <a\s, just slightly more flexible.
Stay slightly flexible, it pays.

Also: title= could present itself as title\s*=\s*

继续阅读：php regex

why would a functional regex fail using PHP's preg_match_all()?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？