开发者

How can I restrict this PHP code to crawl links that have the same base with the domain given?

I am writing a code is a crawler but I want it to crawl all the links that have the same base. For example if you set a big depth and you have a link in your page that links to your twitter, it will scan twitter and give you results like twitter.com/xxxyyyzzz.

What I want i开发者_C百科s to restrict the code to crawl only the urls that have the same base. I don't mind if I set the domain again in a new variable.

Filtering the results and showing only the correct links at the end is not the appropriate way. Imagine if you find 1000 links and you just want the 10.

Thank you for the ideas. (the correct code is in the answer)


MODIFIED

Try this on for size

function crawl_page($url, $depth = 2) {
    static $seen = array();
    if (isset($seen[$url]) || $depth == 0) {
        return;
    }
    
    $seen[$url] = true;
    $parts = parse_url($url);
    $dom = new DOMDocument('1.0');
    if (!$parts || !@$dom->loadHTMLFile($url)) {
        return;
    }
    
    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $anchor) {
        $href = $anchor->getAttribute('href');
        $path = false;
        if (0 !== strpos($href, 'http') && 0 !== strpos($href, 'javascript:')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $path = http_build_url($url, array('path' => $path));
            }
            else {
                $href = "{$parts['scheme']}://";
                if (isset($parts['user'])) {
                    $href .= $parts['user'];
                    if (isset($parts['pass'])) {
                        $href .= ":{$parts['pass']}";
                    }
                    $href .= '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $path = $href . $path;
            }
        }
        else {
            $href_parts = parse_url($href);
            if($href_parts['host'] == $parts['host'] && $href_parts['scheme'] == $parts['scheme']) {
                $path = $href;
            }
        }
        if (!empty($path) && $depth - 1 != 0) {
            crawl_page($path, $depth - 1);
        }
    }
    echo "Crawled: {$url}\n";
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜