开发者

recursion problems

I'm grabbing links from a website, but I'm having a problem in which the higher I set the recursion depth for the function the results become stranger

for example when I set the function to the following

crawl_page("http://www.mangastream.com/", 开发者_JAVA技巧10);

I will get a results like this for about half the page

http://mangastream.com/read/naruto/51619850/1/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2

EDIT

while I'm expecting results like this instead

http://mangastream.com/manga/read/naruto/51619850/1

here's the function I've been using to get the results

function crawl_page($url, $depth)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }
    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
         if(shouldScrape($href)==true)   
          crawl_page($href, $depth - 1);
    }
    echo $url,"\r";
//,pageStatus($url)
}

any help with this would be greatly appreciated


the construction of your new url is not correct, replace :

$href = rtrim($url, '/') . '/' . ltrim($href, '/');

with :

if (substr($href, 0, 1)=='/') {
  // href relative to root
  $info = parse_url($url);
  $href = $info['scheme'].'//'.$info['host'].$href;
} else {
  // href relative to current path
  $href = rtrim(dirname($url), '/') . '/' . $href;
}


I think your problem lies in this line:

$href = rtrim($url, '/') . '/' . ltrim($href, '/');

To all relative urls on any given page this statement will prepend the current page url, which is obviously not what you want. What you need is to prepend only the protocol and host part of the URL.

Something like this should fix your problem (untested):

$url_parts = parse_url($url);
$href = $url_parts['scheme'] . '://' . $url_parts['host '] . $href;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜