I need to write a web crawler for specific user agent

2023-03-05 17:50 问答作者：

I need to write a web crawler, and want to be able to crawl using a known user agent. For example, I want my crawler to act as an iphone to crawl t开发者_运维知识库he mobile site of a website, then crawl again using Mozilla PC agent, etc.

That way, Ill be able to crawl every "type" of site (mobile & PC). However, I also want to be able to set my crawler's user agent, so webmasters also see in their stats that it's a crawler that visited their whole website, not real users.

So my question is, do you guys know how to set a mobile agent + a crawler agent at the same time, in PHP? Is it even possible?

Please refer to RFC1945 for how a User Agent should be formed:

10.15 User-Agent

The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. Although it is not required, user agents should include this field with requests. The field can contain multiple product tokens (Section 3.7) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.
 User-Agent     = "User-Agent" ":" 1*( product | comment )
Example:
  User-Agent: CERN-LineMode/2.15 libwww/2.17b3

So what you put there is more or less up to you. You could pose to be a GoogleBot-Mobile:

https://www.google.com/support/webmasters/bin/answer.py?answer=1061943

or pose as an iPhone and add your own stuff

Mozilla/5.0 (iPhone; U; CPU iPhone OS) (compatible; MyBot/1.0; +http://about.my/bot")

    function crawl($url){

        $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"; // <-- this is user agent
        $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
        $headers[]  = "Accept-Language:en-us,en;q=0.5";
        $headers[]  = "Accept-Encoding:gzip,deflate";
        $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $headers[]  = "Keep-Alive:115";
        $headers[]  = "Connection:keep-alive";
        $headers[]  = "Cache-Control:max-age=0";

        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
        curl_setopt($curl, CURLOPT_ENCODING, "gzip");
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        $data = curl_exec($curl);
        curl_close($curl);
        return $data;

    }

echo crawl("http://www.google.com"); // revenge

You can always use eg http://m.facebook.com/ w/o user agent, although most of websites redirect user to proper content by reading user agent.

继续阅读：php web-crawler

I need to write a web crawler for specific user agent

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？