Image tag scraper regular expression
I'm really REALLY bad at regular expressions. It just hasn't clicked yet. I'm trying to make small application that extracts all image tags of their src, width, and height attributes. This is what I have so far:
<?php
function print_links ($url)
{
$fp = fopen($url, "r") or die("Could not contact $url");
$page_contents = "";
while ($new_text = fread($fp, 100)) {
$page_contents .= $new_text;
}
$match_result =
preg_match_all( '/<img.*src=[\"\'](.*)[\"\'].*width=(\d+).*height=(\d+).*/>/i',
$page_contents,
$match_array,
PREG_SET_ORDER);
echo "number matched is: $match_result<br><br> ";
print_r($match_array);
foreach ($match_array as $entry) {
$tag = $entry[0];
$src = $entry[1];
$width = $entry[2];
$height = $entry[3];
print (" <b>src</b>: $src;
<b>width</b>: $width<br />
<b>height</b>: $height<br />
<b>tag</b>: $tag<br />"
);
}
}
print_links ("http://www.drudgereport.com/");
?>
but I get this little error:
Warning: preg_match_all(): Unknown modifier '>' in C:\Apache2.2\htdocs\it302\regex\regex.php on line 17 number matched is:
I'm not sure where I went wrong in my regexp. I've tried multiple things bu开发者_运维知识库t have ended up just as confused.
Any suggestions?
In your regex the last .*/>
is wrong.
no /
there...
/<img.*src=[\"\'](.*)[\"\'].*width=(\d+).*height=(\d+).*>/i
or \/?
escape and make it optional...
/<img.*src=[\"\'](.*)[\"\'].*width=(\d+).*height=(\d+).*\/?>/i
but this regex only works if src width height are in this given order within the img tag and width and height also allow quoted values and units. e.g. width="0.9em" is valid html...
this are all reasons why you should not use regex to parse html (and many more...)
Do not use regex for this. Especially if you are REALLY bad :)
http://simplehtmldom.sourceforge.net/
foreach($html->find('img') as $element){
$src = $element->src;
$width = $element->width;
$height = $element->height;
print (" <b>src</b>: $src;
<b>width</b>: $width<br />
<b>height</b>: $height<br />
<b>tag</b>: $tag<br />"
);
}
精彩评论