Screen Scraping of Image Links in PHP
I have a website that contains many different pages of products and each page has a certain amount of images in the same format across all pages. I want to be able to screen scrap each page's url so I can retrieve the url of each image from each page. The idea is to make a gallery for each page made up of hotlinked images.
I know this can be do开发者_Python百科ne in php, but I am not sure how to scrap the page for multiple links. Any ideas?
I would recommend using a DOM parser, such as PHP's very own DOMDocument. Example:
$page = file_get_contents('http://example.com/images.php');
$doc = new DOMDocument();
$doc->loadHTML($page);
$images = $doc->getElementsByTagName('img');
foreach($images as $image) {
echo $image->getAttribute('src') . '<br />';
}
You can use a regular expression (regex) to go through the page source and parse all the IMG tags.
This regex will do the job quite nicely: <img[^>]+src="(.*?)"
How does this work?
// <img[^>]+src="(.*?)"
//
// Match the characters "<img" literally «<img»
// Match any character that is not a ">" «[^>]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the characters "src="" literally «src="»
// Match the regular expression below and capture its match into backreference number 1 «(.*?)»
// Match any single character that is not a line break character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character """ literally «"»
Sample PHP code:
preg_match_all('/<img[^>]+src="(.*?)"/i', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
// image URL is in $result[0][$i];
}
You'll have to do a bit more work to resolve things like relative URLs.
I really like PHP Simple HTML DOM Parser for things like this. An example of grabbing images is right there on the front page:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
You can you this to scrap pages.
http://simplehtmldom.sourceforge.net/
but it requires PHP 5+.
精彩评论