开发者

Variables response from file_get_contents for 'https://en.wikipedia.org/wiki/Category:Upcoming_singles'

file_get_contents('https://en.wikipedia.org/wiki/Category:Upco开发者_StackOverflowming_singles');  

returns a different response (2 products) from visiting the same address using the Chrome web browser (shows 4 products).

Upon inspection, I suspect this might be related to

Saved in parser cache key with ... timestamp ...

in the html returned. The timestamp is older when I use file_get_contents()

Any ideas on how to fetch the latest info using file_get_contents()?

Thank you!


Assuming file_get_contents is making an http request, it would be good to check the user agent specified.

I've heard of problems fetching data with some user agents. Take a look at this question.

You can specify other options (including the user agent) by using stream context:

<?php
$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);

Take a look at the file_get_contents docs.

Also, as Jack said, cURL is a better option.

EDIT:

You get me wrong. What you've to add is a different user agent. For example, using the user agent from mozilla firefox get you the 4 results:

<?php

    $opts = array(
      'http'=>array(
        'method'=>"GET",
        'header'=>"Accept-language: en\r\n" .
                  "User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; es-AR; rv:1.9.2.23) Gecko/20110921 Ubuntu/10.10 (maverick) Firefox/3.6.23"
      )
    );

    $context = stream_context_create($opts);

    // Open the file using the HTTP headers set above
    $file = file_get_contents('http://en.wikipedia.org/wiki/Category:Upcoming_singles', false, $context);
    print $file;

But, i think it's not "legal", it's not good to cheat on that. I think there must be any other user agent that wikipedia provides to fetch its data from outside apps.


In any case, you really should be using the MediaWiki API instead of trying to screen-scrape the information from the human-readable category page. For example, try this query using list=categorymembers.

Some notes:

  • Choose the appropriate results format (which, for PHP, is probably format=php).
  • The default limit is 10 results per query, but you can increase it up to 500 with cmlimit=max. After that, you'll need to use the query continuation mechanism.

You can also use one of the existing MediaWiki API client libraries to take care of these and other little details for you.

And finally, please play nice with the Wikimedia servers: don't send multiple simultaneous queries, and cache the results locally if you're going to need them again any time soon. It's a good idea to include your contact information (a URL or an e-mail address) in the User-Agent header, so that Wikimedia's sysadmins can easily contact you if your code is causing excessive server load.


Per the Wikimedia User-Agent policy it is required that all requests identify themselves. I would strongly recommend against faking a browser user-agent. There is no need for that.

Millions of machines access Wikipedia and other Wikimedia Foundation projects all the time. Just identify yourself, your script, it's not hard!

// Identify your bot, script, or company.
// E.g. Link to a website, or provide an e-mail address.
ini_set( 'user_agent', 'MyBot/1.0; John Doe (contact: info@example.og)' );

// Open the file using the HTTP headers set above
$contents = file_get_contents( 'http://en.wikipedia.org/wiki/Sandbox' );
echo $contents;


Try using cURL and setting a header to get the latest info, not caches (Sorry I can't remember the exact header to set)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜