Stopping Google's crawl of my site
Google has started crawling my site, but from a temporary domain (beta.mydomain instead of just mydomain) and also I only want him to crawl just some of my pages. Therefore, I want to stop their crawl and only let them crawl pages I specify in a sitemap. How can I do that? (I know how to add a sitemap, but how can I stop their current crawling and request that they'll crawl just the sitemap)
Update: If I kill the subdomain beta.mydomai开发者_运维知识库n - will that be "fine" by them or will they continue go over all killed pages and "not like" them? Can I specify that in each page's header?
Create a single text file called 'robots.txt' in the root folder for your site. Inside...
User-agent: *
Disallow: /thisfolder/
Disallow: /foo.html
Disallow: /andthisfoldertoo/
Disallow: /andthisfile.html
I use this for project files. In fact, as I write this I think I'll change the way I work on projects and always put them in a sub-directory called /projects/project1/ so one line will do...
Disallow: /projects/
AND I also add a line for my image files. I don't like my images all over the web...
Disallow: /imgs/
You could start with a robots.txt file.
See google's info here
I presume you have already looked at webmaster tools and sitemaps from what you say? Do be aware that while a sitemap will help tell google WHAT to crawl, it won't work very well for telling them what NOT to crawl.
For that you will want to use the robots.txt file to block certain pages / folders.
Use a robots.txt
, see this site.
精彩评论