How can I extract the links from a page of HTML?
I am trying to download a file in php.开发者_运维知识库
$file = file_get_contents($url);
How should i download the contents of the links within the file in $url...
This requires parsing HTML, which is quite a challenge in PHP. To save you a lot of trouble, download an HTML parsing library, such as PHPQuery (http://code.google.com/p/phpquery/). Then you'll have to select all the links with pq('a')
, loop through them getting their href
attribute values, and for each one, convert it from relative to absolute and run a file_get_contents
on the resulting URL. Hopefully these pointers should get you started.
So you want to find all URLs in a given file? RegEx to the rescue... and some sample code below which should do what you want:
$file = file_get_contents($url);
if (!$file) return;
$file = addslashes($file);
//extract the hyperlinks from the file via regex
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $file, $urlmatches);
//if there are any URLs to be found
if (count($urlmatches)) {
$urlmatches = $urlmatches[0];
//count number of URLs
$numberofmatches = count($matches);
echo "Found $numberofmatches URLs in $url\n";
//write all found URLs line by line
foreach($urlmatches as $urlmatch) {
echo "URL: $urlmatch...\n";
}
}
EDIT: When I understand your question correctly, you now want to download the contents of the found URLs. You would do that in the foreach
loop calling file_get_contents
for each URL, but you probably want to do some filtering beforehand (like don't download images etc.).
You'll need to parse the resulting HTML string, either manually, or via a 3rd party plugin.
HTML Scraping in Php
精彩评论