Would there be any performance benefits to doing this? PHP Question

2023-02-12 17:59 问答作者：

I'm creating a site spider that grabs all the links from a web page as well as that page's html source code. It then checks all the links it has found and keeps only the internal ones. Next it goes to each of those internal pages and repeats the above process.

Basically it's job is to crawl all the pages under a specified domain and grab each page's source. Now the reason for this is, I want to run some checks to see if this or that keyword is found on any of the pages as well as to list each page's meta information.

I would like to know if I should run these checks on the html during the crawling phase of each page or if I should save all the html in an array for example and run the checks all the way at the end. Which woul开发者_运维百科d be better performance wise?

Seems like you very well may run into memory issues if you try to save all the data (in memory) for processing later. You may be able to use the curl_multi_* functions to efficiently process while fetching.

You should use either phpQuery or QueryPath or one of the alternatives listed here: How do you parse and process HTML/XML in PHP?

This simplifies fetching the pages, as well as extracting the links. Basically you just need something like:

$page = qp("http://example.org/");   // QueryPath

foreach ($page->find("a") as $link) {
     print $link->attr("href");
     // test if local link, then fetch next page ...
}

phpQuery has some more functions which simplify crawling (turning local links into absolute urls, etc). But you'll have to consult the documentation. And you might also need a better appraoch for recursion, maybe a page/url stack to work on:

$pool = array();
$pool[] = "http://example.com/first-url.html";   // to begin with

while ($url = array_pop($pool)) {
    // fetch
    // add found links to $pool[] = ...
    // (but also make a $visited[] list, to avoid neverending loop)
}

It's something you shouldn't want to overoptimize. Run it as standalone script, and process each page individually.

继续阅读：performance php

Would there be any performance benefits to doing this? PHP Question

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？