why would a functional regex fail using PHP's preg_match_all()?
I have the following regex in a PHP script
$total_matches = preg_match_all('{
<a\shref="
(?<link>[^"]+)
"(?:(?!src=).)+src="
(?<image>[^"]+)
(?:开发者_StackOverflow中文版(?!designer-name">).)+designer-name">
(?<brand>[^<]+)
(?:(?!title=).)+title="
(?<title>((?!">).)+)
(?:(?!"price">).)+"price">\$
(?<price>[\d.,]+)
}xsi',$output,$all_matches,PREG_SET_ORDER);
this regex seems to work fine when parsing the following (via php or using the parser at regexr.com (with the same options set for case insensitive, extended, treat line breaks as whitespace):
<a href="http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
"DORDOGNE 120 PLATEAU SANDALEN" class="product-image">
<img class="image1st" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/small_ image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-STANDARD.jpg"
width="230" height="260"
alt= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH"
title= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" />
<img class="image2nd" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-DETAIL_2.jpg"
width="230" height="260" alt=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" title=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> <span class=
"availability"><strong>available sizes</strong><br /></span></a>
<div style="margin-left: 2em" class="available-sizes">
<h2 class="designer-name">Christian Louboutin</h2>
<div class="product-buttons">
<div class="product-button">
NEW ARRIVAL
</div>
<div class="clearer"></div>
</div>
<h3 class="product-name"><a href=
"http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
"DORDOGNE 120 SANDALS">DORDOGNE 120 SANDALS</a></h3>
<div class="price-box">
<span class="regular-price" id="product-price-114114"><span class=
"price">$805.00</span></span>
</div>
If I try to parse several matches in a row, it works fine also. However when I try parsing the full page these matches come from (I have permission to parse this)
http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12
the regex fails (I actually get a 500 error). I've tried increasing the backtrack limit using
ini_set('pcre.backtrack_limit',100000000);
ini_set('pcre.recursion_limit',100000000);
but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up. I am running PHP 5.2.17.
You have made one of the classic blunders! Don't use regex to parse HTML! It breaks regex! (This is right after "Never get involved in a land war in Asia" and "Never go in against a Sicilian when death is on the line.").
You should be using SimpleXML or DomDocument to parse this:
$dom = new DomDocument();
$dom->loadHTML( 'http://www.mytheresa.com/us_en/new-arrivals/'.
'what-s-new-this-week-1.html?limit=12' );
$path = new DomXPath( $dom );
// this query is based on the link you provided, not your regex
$nodes = $path->evaluate( '//ul[class="products-grid first odd"]/li' );
foreach( $nodes as $node )
{
// children 0 = anchor tag you're looking for initially.
echo $node->children[0]->getAttribute("href");
// iterate through the other children that way
}
Those negative lookaheads are clever, but then... slightly too clever.
And I concur, you used too many of them, not to get side effects.
Can't see which one is running wild right now, but putting a repeated .
like that... is always bound to give you greediness problems.
this one for example, is certainly unnecessary:
title="
(?<title>((?!">).)
as you could have written it
title="(?<title>.*?)">
...there are more like it. I'd change them all.
In general, regex debugging implies rephrasing and rephrasing it again and again and again, using different constructs until you find the right balance between functionality and mantainability.
Another thing: I would use <a\s+
instead of <a\s
, just slightly more flexible.
Stay slightly flexible, it pays.
Also: title=
could present itself as title\s*=\s*
精彩评论