开发者

How to speed up class that checks links in HTML?

I have cobbled together a class that checks links. It works but it is slow:

The class basically parses a HTML string and returns all invalid links for href and src attributes. Here is how I use it:

$class = new Validurl(array('html' => file_get_contents('http://google.com')));

$invalid_links = $class->check_links();

print_r($invalid_links);

With HTML tha开发者_如何学Pythont has a lot of links it becomes really slow and I know it has to go through each link and follow it, but maybe someone with more experience can give me a few pointers on how to speed it up.

Here's the code:

class Validurl{

    private $html = '';

    public function __construct($params){ 

        $this->html = $params['html'];

    } 

    public function check_links(){

        $invalid_links = array();    

        $all_links = $this->get_links();

        foreach($all_links as $link){

            if(!$this->is_valid_url($link['url'])){

                array_push($invalid_links, $link);

            }

        }

        return  $invalid_links;

    }

    private function get_links() {

        $xml = new DOMDocument();

        @$xml->loadHTML($this->html);

        $links = array();

        foreach($xml->getElementsByTagName('a') as $link) {
            $links[] = array('type' => 'url', 'url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
        }

        foreach($xml->getElementsByTagName('img') as $link) {
            $links[] = array('type' => 'img', 'url' => $link->getAttribute('src'));
        }        

        return $links;
    }

    private function is_valid_url($url){

         if ((strpos($url, "http")) === false) $url = "http://" . $url;

         if (is_array(@get_headers($url))){

              return true;

         }else{

             return false;

         }
    }

}


First of all I would not push the links and images into an array, and then iterate through the array, when you could directly iterate the results of getElementsByTagName(). You'd have to do it twice for <a> and <img> tags, but if you separate the checking logic into a function, you just call that for each round.

Second, get_headers() is slow, based on comments from the PHP manual page. You should rather use cUrl in some way like this (found in a comment on the same page):

function get_headers_curl($url) 
{ 
    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_URL,            $url); 
    curl_setopt($ch, CURLOPT_HEADER,         true); 
    curl_setopt($ch, CURLOPT_NOBODY,         true); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
    curl_setopt($ch, CURLOPT_TIMEOUT,        15); 

    $r = curl_exec($ch); 
    $r = split("\n", $r); 
    return $r; 
}

UPDATE: and yes, some kind of caching could also help, e.g. an SQLITE database with one table for the link and the result, and you could purge that db like each day.


You could cache the results (in DB, eg: a key-value store), so that your validator assumes that if a link was valid it's going to be valid for 24 hours or a week or something like that.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜