disallow certain url in robots.txt [closed]
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this questionWe implemented a rating system on a site a while back that involves a link to a script. However, with the vast majority of ratings on the site at 3/5 and the ratings very even across 1-5 we're beginning to suspect that search engine crawlers etc. are getting through. The urls used look like this:
http://www.thesite.com/path/to/the/page/rate?uid=abcdefghijk&value=3
When we started we add the following to our robots.txt:
User-agent: *
Disallow: /rate
Is this incorrect or are googlebot and others simply ignoring our robots.txt?
You should use POST for actions which change things as search engine usually do not submit forms. Additionally, this will prevent users who download your website recursively (e.g. with wget) from submitting tons of votes.
Depending on your site, handling voting though javascript might be a solution, too.
Regarding your robots.txt:
It has to be in the root path - i.e. http://www.thesite.com/robots.txt - and if your rating system is at /blah/rate you need to use Disallow: /blah/rate
instead of Disallow: /rate
Looks incorrect to me. You're only disallowing access to http://www.thesite.com/rate
(and pages below it IIRC). Plus some crawlers ignore robots.txt
!
Better to make it so that ratings are only ever altered in response to a POST, rather than a GET. Search engines never use POST.
User-agent: *
Disallow: /path/to/the/page/rate
You have to use the full path.
Might want to read up here a bit: http://www.javascriptkit.com/howto/robots.shtml
精彩评论