开发者

How can I extract the links from a page of HTML?

I am trying to download a file in php.开发者_运维知识库

$file = file_get_contents($url);

How should i download the contents of the links within the file in $url...


This requires parsing HTML, which is quite a challenge in PHP. To save you a lot of trouble, download an HTML parsing library, such as PHPQuery (http://code.google.com/p/phpquery/). Then you'll have to select all the links with pq('a'), loop through them getting their href attribute values, and for each one, convert it from relative to absolute and run a file_get_contents on the resulting URL. Hopefully these pointers should get you started.


So you want to find all URLs in a given file? RegEx to the rescue... and some sample code below which should do what you want:

$file = file_get_contents($url);
if (!$file) return;
$file = addslashes($file);

//extract the hyperlinks from the file via regex
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $file, $urlmatches);

//if there are any URLs to be found
if (count($urlmatches)) {
    $urlmatches = $urlmatches[0];
    //count number of URLs
    $numberofmatches = count($matches);
    echo "Found $numberofmatches URLs in $url\n";

    //write all found URLs line by line
    foreach($urlmatches as $urlmatch) {
        echo "URL: $urlmatch...\n";
    }
}

EDIT: When I understand your question correctly, you now want to download the contents of the found URLs. You would do that in the foreach loop calling file_get_contents for each URL, but you probably want to do some filtering beforehand (like don't download images etc.).


You'll need to parse the resulting HTML string, either manually, or via a 3rd party plugin.

HTML Scraping in Php

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜