How can I restrict this PHP code to crawl links that have the same base with the domain given?
I am writing a code is a crawler but I want it to crawl all the links that have the same base. For example if you set a big depth and you have a link in your page that links to your twitter, it will scan twitter and give you results like twitter.com/xxxyyyzzz.
What I want i开发者_C百科s to restrict the code to crawl only the urls that have the same base. I don't mind if I set the domain again in a new variable.
Filtering the results and showing only the correct links at the end is not the appropriate way. Imagine if you find 1000 links and you just want the 10.
Thank you for the ideas. (the correct code is in the answer)
MODIFIED
Try this on for size
function crawl_page($url, $depth = 2) {
static $seen = array();
if (isset($seen[$url]) || $depth == 0) {
return;
}
$seen[$url] = true;
$parts = parse_url($url);
$dom = new DOMDocument('1.0');
if (!$parts || !@$dom->loadHTMLFile($url)) {
return;
}
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$href = $anchor->getAttribute('href');
$path = false;
if (0 !== strpos($href, 'http') && 0 !== strpos($href, 'javascript:')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$path = http_build_url($url, array('path' => $path));
}
else {
$href = "{$parts['scheme']}://";
if (isset($parts['user'])) {
$href .= $parts['user'];
if (isset($parts['pass'])) {
$href .= ":{$parts['pass']}";
}
$href .= '@';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$path = $href . $path;
}
}
else {
$href_parts = parse_url($href);
if($href_parts['host'] == $parts['host'] && $href_parts['scheme'] == $parts['scheme']) {
$path = $href;
}
}
if (!empty($path) && $depth - 1 != 0) {
crawl_page($path, $depth - 1);
}
}
echo "Crawled: {$url}\n";
}
精彩评论