Robots.txt and Google Calendar

2023-02-03 23:08 问答作者：

I'm looking for the best solution on how I can ensure I am doing this correctly:

I have a calendar on my website, in which users can take the calendar iCal feed a开发者_如何学JAVAnd import it into external calendars of their preference (Outlook, iCal, Google Calendar, etc...).

To deter bad people from crawling/searching my website for the *.ics files, I've setup Robots.txt to disallow the folders in which the feeds are stored.

So, essentially, an iCal feed might look like: webcal://www.mysite.com/feeds/cal/a9d90309dafda390d09/feed.ics

I understand the above is still a public URL. However, I have a function in which the user can change address of their feed, if they want.

My question is: All external calendars have no problem importing/subscribing to the calendar feed, except for Google Calendar. It throws the message: Google was unable to crawl the URL due to a robots.txt restriction. Google's Answer to This.

Consequently, after searching around, I've found that the following works:

1) Setup a PHP file (which I am using) that essentially forces a download of the file. It basically looks like this:

<?php
$url = "/home/path/to/local/feed/".$_GET['url'];
 $file = fopen ($url, "r");
 if (!$file) {
    echo "<p>Unable to open remote file.\n";
    exit;
  }
 while (!feof ($file)) {
  $line = fgets ($file, 1024);
 print $line;
}
fclose($file);
?>

I tried using this script, and it appeared to work with Google Calendar, with no issues. (Although, I'm not sure if it updates/refreshes yet. I'm still waiting to see if this works).

My question is this: Is there a better way to approach such an issue? I'd like to keep the current Robots.txt in place to disallow crawling my directories for *.ics files and keep the files hidden.

I recently had this problem and this robots.txt works for me.

User-agent: Googlebot
Allow: /*.ics$
Disallow: /

User-agent: *
Disallow: /

This allows access to any .ics files if they know the address and prevents the bots from searching the site (it's a private server). You will want to change the disallow tag for your server.

I don't think the allow tag is part of the spec but some bots seem to support it. Here is Google's Webmaster Tools help page on robots.txt
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

Looks to me you have two problems:

Prevent bad-behavioral bots accessing the website.
After installing robots.txt, allow Googlebot access your site.

The first problem cannot be solved by robots.txt. As Marc B points out in comment, robots.txt is a purely voluntary mechanism. In order to block badbots once for all, I will suggest you using some kind of behavior-analysis program/firewall to detect bad bots and deny access from these IPs.

For the second problem, robots.txt do allow you whitelist a particular bot. Check http://facebook.com/robots.txt as example. Noted that Google identify their bots in different names (for Adsence, search, image search, mobile search), I am not if the Google calendar bot uses the generic Google bot name or not.

继续阅读：google-calendar-api icalendar php robots.txt

Robots.txt and Google Calendar

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？