Would there be any performance benefits to doing this? PHP Question
I'm creating a site spider that grabs all the links from a web page as well as that page's html source code. It then checks all the links it has found and keeps only the internal ones. Next it goes to each of those internal pages and repeats the above process.
Basically it's job is to crawl all the pages under a specified domain and grab each page's source. Now the reason for this is, I want to run some checks to see if this or that keyword is found on any of the pages as well as to list each page's meta information.
I would like to know if I should run these checks on the html during the crawling phase of each page or if I should save all the html in an array for example and run the checks all the way at the end. Which woul开发者_运维百科d be better performance wise?
Seems like you very well may run into memory issues if you try to save all the data (in memory) for processing later. You may be able to use the curl_multi_*
functions to efficiently process while fetching.
You should use either phpQuery or QueryPath or one of the alternatives listed here: How do you parse and process HTML/XML in PHP?
This simplifies fetching the pages, as well as extracting the links. Basically you just need something like:
$page = qp("http://example.org/"); // QueryPath
foreach ($page->find("a") as $link) {
print $link->attr("href");
// test if local link, then fetch next page ...
}
phpQuery has some more functions which simplify crawling (turning local links into absolute urls, etc). But you'll have to consult the documentation. And you might also need a better appraoch for recursion, maybe a page/url stack to work on:
$pool = array();
$pool[] = "http://example.com/first-url.html"; // to begin with
while ($url = array_pop($pool)) {
// fetch
// add found links to $pool[] = ...
// (but also make a $visited[] list, to avoid neverending loop)
}
It's something you shouldn't want to overoptimize. Run it as standalone script, and process each page individually.
精彩评论