recursion problems
I'm grabbing links from a website, but I'm having a problem in which the higher I set the recursion depth for the function the results become stranger
for example when I set the function to the following
crawl_page("http://www.mangastream.com/", 开发者_JAVA技巧10);
I will get a results like this for about half the page
http://mangastream.com/read/naruto/51619850/1/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2
EDIT
while I'm expecting results like this instead
http://mangastream.com/manga/read/naruto/51619850/1
here's the function I've been using to get the results
function crawl_page($url, $depth)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
crawl_page($href, $depth - 1);
}
echo $url,"\r";
//,pageStatus($url)
}
any help with this would be greatly appreciated
the construction of your new url is not correct, replace :
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
with :
if (substr($href, 0, 1)=='/') {
// href relative to root
$info = parse_url($url);
$href = $info['scheme'].'//'.$info['host'].$href;
} else {
// href relative to current path
$href = rtrim(dirname($url), '/') . '/' . $href;
}
I think your problem lies in this line:
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
To all relative urls on any given page this statement will prepend the current page url, which is obviously not what you want. What you need is to prepend only the protocol and host part of the URL.
Something like this should fix your problem (untested):
$url_parts = parse_url($url);
$href = $url_parts['scheme'] . '://' . $url_parts['host '] . $href;
精彩评论