开发者

What PHP web crawler libraries are available?

I'm looking for some robust, well documented PHP web crawler script开发者_如何学运维s. Perhaps a PHP port of the Java project - http://wiki.apache.org/nutch/NutchTutorial

I'm looking for both free and non free versions.


Just give Snoopy a try.

Excerpt: "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example."


https://github.com/fabpot/Goutte is also a good library compatible with psr-0 standard.


You can use PHP Simple HTML DOM Parser . It's really simple and useful.


I've been using Simple HTML DOM for about 3 years before I discovered phpQuery. It's a lot faster, not working recursively (you can actually dump it) and has a full support for jQuery selectors and methods.


There is a greate tutorial here which combines guzzlehttp and symfony/dom-crawler

In case the link is lost here is the code you can make use.

use Guzzle\Http\Client;
use Symfony\Component\DomCrawler\Crawler;
use RuntimeException;

// create http client instance
$client = new GuzzleHttp\ClientClient('http://download.cloud.com/releases');

// create a request
$response = $client->request('/3.0.6/api_3.0.6/TOC_Domain_Admin.html');

// get status code
$status = $response->getStatusCode();

// this is the response body from the requested page (usually html)
//$result = $response->getBody();

// crate crawler instance from body HTML code
$crawler = new Crawler($response->getBody(true));

// apply css selector filter
$filter = $crawler->filter('div.apismallbullet_box');
$result = array();

if (iterator_count($filter) > 1) {

    // iterate over filter results
    foreach ($filter as $i => $content) {

        // create crawler instance for result
        $cralwer = new Crawler($content);
        // extract the values needed
        $result[$i] = array(
            'topic' => $crawler->filter('h5')->text();
            'className' => trim(str_replace(' ', '', $result[$i]['topic'])) . 'Client'
        );
    }
} else {
    throw new RuntimeException('Got empty result processing the dataset!');
}


if you are thinking about a strong base component than give a try to http://symfony.com/doc/2.0/components/dom_crawler.html

it is amazing, having a features like css selector.


I know it is a bit old question. A lot of useful libraries came out since then.

Give it a shot to Crawlzone. It is fast, well documented, asynchronous internet crawling framework with a lot of great features:

  • Asynchronous crawling with customizable concurrency.
  • Automatically throttling crawling speed based on the load of the website you are crawling.
  • If configured, automatically filters out requests forbidden by the robots.txt exclusion standard.
  • Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response.
  • Rich filtering capabilities.
  • Ability to set crawling depth
  • Easy to extend the core by hooking into the crawling process using events.
  • Shut down crawler any time and start over without losing the progress.

Also check out the article I wrote about it:

https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm


Nobody mentioned wget as a good starting point?.

wget -r --level=10 -nd http://www.mydomain.com/

More @ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜