Variables response from file_get_contents for 'https://en.wikipedia.org/wiki/Category:Upcoming_singles'

2023-04-11 20:44 问答作者：

file_get_contents('https://en.wikipedia.org/wiki/Category:Upco开发者_StackOverflowming_singles');

returns a different response (2 products) from visiting the same address using the Chrome web browser (shows 4 products).

Upon inspection, I suspect this might be related to

Saved in parser cache key with ... timestamp ...

in the html returned. The timestamp is older when I use file_get_contents()

Any ideas on how to fetch the latest info using file_get_contents()?

Thank you!

Assuming file_get_contents is making an http request, it would be good to check the user agent specified.

I've heard of problems fetching data with some user agents. Take a look at this question.

You can specify other options (including the user agent) by using stream context:

<?php
$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);

Take a look at the file_get_contents docs.

Also, as Jack said, cURL is a better option.

EDIT:

You get me wrong. What you've to add is a different user agent. For example, using the user agent from mozilla firefox get you the 4 results:

<?php

    $opts = array(
      'http'=>array(
        'method'=>"GET",
        'header'=>"Accept-language: en\r\n" .
                  "User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; es-AR; rv:1.9.2.23) Gecko/20110921 Ubuntu/10.10 (maverick) Firefox/3.6.23"
      )
    );

    $context = stream_context_create($opts);

    // Open the file using the HTTP headers set above
    $file = file_get_contents('http://en.wikipedia.org/wiki/Category:Upcoming_singles', false, $context);
    print $file;

But, i think it's not "legal", it's not good to cheat on that. I think there must be any other user agent that wikipedia provides to fetch its data from outside apps.

In any case, you really should be using the MediaWiki API instead of trying to screen-scrape the information from the human-readable category page. For example, try this query using list=categorymembers.

Some notes:

Choose the appropriate results format (which, for PHP, is probably format=php).
The default limit is 10 results per query, but you can increase it up to 500 with cmlimit=max. After that, you'll need to use the query continuation mechanism.

You can also use one of the existing MediaWiki API client libraries to take care of these and other little details for you.

And finally, please play nice with the Wikimedia servers: don't send multiple simultaneous queries, and cache the results locally if you're going to need them again any time soon. It's a good idea to include your contact information (a URL or an e-mail address) in the User-Agent header, so that Wikimedia's sysadmins can easily contact you if your code is causing excessive server load.

Per the Wikimedia User-Agent policy it is required that all requests identify themselves. I would strongly recommend against faking a browser user-agent. There is no need for that.

Millions of machines access Wikipedia and other Wikimedia Foundation projects all the time. Just identify yourself, your script, it's not hard!

// Identify your bot, script, or company.
// E.g. Link to a website, or provide an e-mail address.
ini_set( 'user_agent', 'MyBot/1.0; John Doe (contact: info@example.og)' );

// Open the file using the HTTP headers set above
$contents = file_get_contents( 'http://en.wikipedia.org/wiki/Sandbox' );
echo $contents;

Try using cURL and setting a header to get the latest info, not caches (Sorry I can't remember the exact header to set)

继续阅读：caching php screen-scraping wikipedia

Variables response from file_get_contents for 'https://en.wikipedia.org/wiki/Category:Upcoming_singles'

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？