开发者

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)

   (page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)

So i'm using preg_match_all, i want to get only "buy_tickets.gif"

$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.开发者_StackOverflow社区'(.*?)'.$pattern_after.'#si';

preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);

Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link

(page...)<td valign='top' class='rdBot' align='center'><a href="blaa" title="ble"><img src="/images/buy_tickets.gif" border="0" alt="T"></a></td> (page...)

and i dunno how to get always my code to work (not just when the image gets no link)

hope u understand

thanks in advance


Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:

$doc = new DOMDocument;

@$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the @ operator to hide parse errors

$xpath  = new DOMXPath( $doc );

$img = $xpath->query( '//td[@class="BrdBot"][@align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)

$imgSrc = $img->getAttribute( 'src' );

$imgSrcInfo = pathInfo( $imgSrc );

$imgFilename = $imgSrcInfo['basename']; // All you need


You're going to get lots of advice not to use regex for pulling stuff out of HTML code.

There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.

The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.

It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.

Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.

The details you've given make it almost a no-brainer to go with this rather than a regex.

PHP has a built-in DOM parser, which you can call as follows:

$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");

You can then use XPath to search the DOM for your specific element or attribute that you want:

$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[@class="rdbot"]//img[0]@src");

Hope that helps.


function GetFilename($file) {
    $filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
    return $filename;
}
echo GetFilename('/images/buy_tickets.gif');

This will output buy_tickets.gif


Do you only need images inside of the "td" tags?

$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';

edit:

to grab the specific image this should work:

$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/


Parsing HTML with Regex is not recommended, as has been mentioned by several posters.

However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:

$pattern = <<<EOD 
#src\s*=\s*['"]/images/(.*?)["']# 
EOD;

If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.


I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.

If you still want to go through regex, try this:

\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i

This should work. It does work in C#, I am not totally sure about php's brand of regex.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜