开发者

Disallow or Noindex on Subdomain with robots.txt

I have dev.example.com and www.example.com hosted on different subdomains. I want crawlers to drop all records of the dev subdomain 开发者_JS百科but keep them on www. I am using git to store the code for both, so ideally I'd like both sites to use the same robots.txt file.

Is it possible to use one robots.txt file and have it exclude crawlers from the dev subdomain?


You could use Apache rewrite logic to serve a different robots.txt on the development domain:

<IfModule mod_rewrite.c>
    RewriteEngine on
    RewriteCond %{HTTP_HOST} ^dev\.qrcodecity\.com$
    RewriteRule ^robots\.txt$ robots-dev.txt
</IfModule>

And then create a separate robots-dev.txt:

User-agent: *
Disallow: /


Sorry, this is most likely not possible. The general rule is that each sub-domain is treated separately and thus would both need robots.txt files.

Often subdomains are implemented using subfolders with url rewriting in place that does the mapping in which you want to share a single robots.txt file across subdomains. Here's a good discussion of how to do this: http://www.webmasterworld.com/apache/4253501.htm.

However, in your case you want different behavior for each subdomain which is going to require separate files.


Keep in mind that if you block Google from indexing the pages under the subdomain, they won't (usually) immediately drop out of the Google index. It merely stops Google from re-indexing those pages.

If the dev subdomain isn't launched yet, make sure it has it's own robots.txt disallowing everything.

However, if the dev subdomain already has pages indexed, then you need to use the robots noindex meta tags first (which requires Google to crawl the pages initially to read this request), then set up the robots.txt file for the dev subdomain once the pages have dropped out of the Google index (set up a Google Webmaster Tools account - it helps to work this out).


I want Google to drop all of the records of the dev subdomain but keep the www.

If the dev site has already been indexed return a 404 or 410 error to crawlers to delist content.

Is it possible to have one robots.txt file that excludes a subdomain?

If your code is completely static what you're looking for the non-standard host directive:

User-agent: *
Host: www.example.com

But if you can support a templating language it's possible to keep everything in a single file:

User-agent: *
# if ENVIRONMENT variable is false robots will be disallowed.
{{ if eq (getenv "ENVIRONMENT") "production" }}
  Disallow: admin/
  Disallow:
{{ else }}
  Disallow: /
{{ end }}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜