How to protect a site from (google) caching?
I would like to hide some content from publi开发者_开发知识库c (like google cached pages). Is it possible?
Add the following HTML tag in the <head>
section of your web pages to prevent Google from showing the Cached link for a page.
<META NAME="ROBOTS" CONTENT="noarchive">
Check out Google webmaster central | Meta tags to see what other meta tags Google understands.
Option 1: Disable 'Show Cached Site' Link In Google Search Results
If you want to prevent google from archiving your site, add the following meta tag to your section:
<meta name="robots" content="noarchive">
If your site is already cached by Google, you can request its removal using Google's URL removal tool. For more instructions on how to use this tool, see "Remove a page or site from Google's search results" at Google Webmaster Central.
Option 2: Remove Site From Google Index Completely
Warning! The following method will remove your site from Google index completely. Use it only if you don't want your site to show up in Google results.
To prevent ("protect") your site from getting to Google's cache, you can use robots.txt
. For instructions on how to use this file, see "Block or remove pages using a robots.txt file".
In principle, you need to create a file named robots.txt
and serve it from your site's root folder (/robots.txt
). Sample file content:
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
In addition, consider setting robots
meta tag in your HTML document to noindex
("Using meta tags to block access to your site"):
- To prevent all robots from indexing your site, set
<meta name="robots" content="noindex">
- To selectively block only Google, set
<meta name="googlebot" content="noindex">
Finally, make sure that your settings really work, for instance with Google Webmaster Tools.
robots.txt: http://www.robotstxt.org/
You can use a robots.txt
file to request that your page is not indexed. Google and other reputable services will adhere to this, but not all do.
The only way to make sure that your site content isn't indexed or cached by any search engine or similar service is to prevent access to the site unless the user has a password.
This is most easily achieved using HTTP Basic Auth. If you're using the Apache web server, there are lots of tutorials (example) on how to configure this. A good search term to use is htpasswd
.
A simple way to do this would be with a <meta name="robots" content="noarchive"/>
You can also achieve a similar effect with the robots.txt file.
For a good explanation, see the official google blog on the robot's execution policy
I would like to hide some content from public....
Use a login system to view the content.
...(like google cached pages).
Configure robots.txt
to deny Google bot.
If you want to limit who can see content, secure it behind some form of authentication mechanism (e.g. password protection, even if it is just HTTP Basic Auth).
The specifics of how to implement that would depend on the options provided by your server.
You can also add this HTTP Header on your response, instead of needing to update the html files:
X-Robots-Tag: noarchive
eg for Apache:
Header set X-Robots-Tag "noarchive"
See also: https://developers.google.com/search/reference/robots_meta_tag?csw=1
精彩评论