What PHP web crawler libraries are available?

2023-02-07 18:45 问答作者：

I'm looking for some robust, well documented PHP web crawler script开发者_如何学运维s. Perhaps a PHP port of the Java project - http://wiki.apache.org/nutch/NutchTutorial

I'm looking for both free and non free versions.

Just give Snoopy a try.

Excerpt: "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example."

https://github.com/fabpot/Goutte is also a good library compatible with psr-0 standard.

You can use PHP Simple HTML DOM Parser . It's really simple and useful.

I've been using Simple HTML DOM for about 3 years before I discovered phpQuery. It's a lot faster, not working recursively (you can actually dump it) and has a full support for jQuery selectors and methods.

There is a greate tutorial here which combines guzzlehttp and symfony/dom-crawler

In case the link is lost here is the code you can make use.

use Guzzle\Http\Client;
use Symfony\Component\DomCrawler\Crawler;
use RuntimeException;

// create http client instance
$client = new GuzzleHttp\ClientClient('http://download.cloud.com/releases');

// create a request
$response = $client->request('/3.0.6/api_3.0.6/TOC_Domain_Admin.html');

// get status code
$status = $response->getStatusCode();

// this is the response body from the requested page (usually html)
//$result = $response->getBody();

// crate crawler instance from body HTML code
$crawler = new Crawler($response->getBody(true));

// apply css selector filter
$filter = $crawler->filter('div.apismallbullet_box');
$result = array();

if (iterator_count($filter) > 1) {

    // iterate over filter results
    foreach ($filter as $i => $content) {

        // create crawler instance for result
        $cralwer = new Crawler($content);
        // extract the values needed
        $result[$i] = array(
            'topic' => $crawler->filter('h5')->text();
            'className' => trim(str_replace(' ', '', $result[$i]['topic'])) . 'Client'
        );
    }
} else {
    throw new RuntimeException('Got empty result processing the dataset!');
}

if you are thinking about a strong base component than give a try to http://symfony.com/doc/2.0/components/dom_crawler.html

it is amazing, having a features like css selector.

I know it is a bit old question. A lot of useful libraries came out since then.

Give it a shot to Crawlzone. It is fast, well documented, asynchronous internet crawling framework with a lot of great features:

Asynchronous crawling with customizable concurrency.
Automatically throttling crawling speed based on the load of the website you are crawling.
If configured, automatically filters out requests forbidden by the robots.txt exclusion standard.
Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response.
Rich filtering capabilities.
Ability to set crawling depth
Easy to extend the core by hooking into the crawling process using events.
Shut down crawler any time and start over without losing the progress.

Also check out the article I wrote about it:

https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm

Nobody mentioned wget as a good starting point?.

wget -r --level=10 -nd http://www.mydomain.com/

More @ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/

继续阅读：php web-crawler

What PHP web crawler libraries are available?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？