How do you disallow crawling on origin server and yet have the robots.txt propagate properly?

2023-03-04 06:49 问答作者：

I've come across a rather unique issue. If you deal with scaling large sites and work with a company like Akamai, you have origin servers that Akamai talks to. Whatever you serve to Akamai, they will propaga开发者_开发问答te on their cdn.

But how do you handle robots.txt? You don't want Google to crawl your origin. That can be a HUGE security issue. Think denial of service attacks.

But if you serve a robots.txt on your origin with "disallow", then your entire site will be uncrawlable!

The only solution I can think of is to serve a different robots.txt to Akamai and to the world. Disallow to the world, but allow to Akamai. But this is very hacky and prone to so many issues that I cringe thinking about it.

(Of course, origin servers shouldn't be viewable to the public, but I'd venture to say most are for practical reasons...)

It seems an issue the protocol should be handling better. Or perhaps allow a site-specific, hidden robots.txt in the Search Engine's webmaster tools...

Thoughts?

If you really want your origins not to be public, use a firewall / access control to restrict access for any host other than Akamai - it's the best way to avoid mistakes and it's the only way to stop the bots & attackers who simply scan public IP ranges looking for webservers.

That said, if all you want is to avoid non-malicious spiders, consider using a redirect on your origin server which redirects any requests which don't have a Host header specifying your public hostname to the official name. You generally want something like that anyway to avoid issues with confusion or search rank dilution if you have variations of the canonical hostname. With Apache this could use mod_rewrite or even a simple virtualhost setup where the default server has RedirectPermanent / http://canonicalname.example.com/.

If you do use this approach, you could either simply add the production name to your test systems' hosts file when necessary or also create and whitelist an internal-only hostname (e.g. cdn-bypass.mycorp.com) so you can access the origin directly when you need to.

继续阅读：akamai cdn robots.txt

How do you disallow crawling on origin server and yet have the robots.txt propagate properly?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？