Robots.txt and locations that are not referenced
If I want to protect a folder from being crawled by robots that respect standards I can disallow it from robots.txt.
Now, the problem I get is that by hiding a folder, I am showing its existence to others.
So, do I have to specify a folder I do not want crawled in robots.txt if there are no links to it? "Good" crawlers only follow links right, they don't search for folders and f开发者_如何学Goiles randomly.
Thank you.
Let me assure you, as the author of a "good" web crawler, that if there's something publicly accessible on the Web, a crawler will find it. If you create a folder like http://example.com/hidden_folder
and think that by not publishing links to it nobody will find it, you're wrong. It's no better than hiding your house key under the door mat. Although a crawler likely won't go searching for hidden_folder
, others will. And when they find it, they'll post a link to it, and my crawler will find that link.
The same sort of thing can happen even if nobody goes looking for your hidden folder. For example, imagine that you have a file http://example.com/hidden_folder/bookmarks.html
. In it, you have links to all your favorite sites.
When you click on one of those links (say, joesblog.com), the request your browser sends to joesblog.com includes the referring url--the html file in your "hidden" folder.
You'd be surprised at how many sites publish their access logs. If joesblog is one of them, then somewhere on that site you're going to see a file that says, in effect, "joesblog.com was accessed from http://example.com/hidden_folder/bookmarks.html."
As others have said, security through obscurity doesn't work. If there's some information on your site that you don't want accessed, then protect it with a password or some other method. Do not assume that crawlers or people won't find it just because you didn't explicitly tell them about it.
Edit:
If you don't list the folders in your robots.txt file, then robots will crawl them, given a link. If you do list the folders, then "good" bots will not crawl. "Bad" bots will crawl regardless.
In my opinion, the likelihood of somebody reading your robots.txt in order to find links to hidden directories is lower than the likelihood of those links being discovered by other means. I would suggest using the solution proposed by @Joachim, which will prevent "good" bots from crawling, and won't reveal the exact directory name.
Also, if you disable directory listing and don't have a default page in your folder, then a bot going to http://example.com/hidden_folder/
won't get anything but an error message saying that the directory contents can't be listed.
Since the Disallow
lines in robots.txt
are prefixes, you could just mention a prefix to your "hidden" directory that it doesn't share with any "public" directories.
So if your "hidden" directory is called /topsecrete_donotread/
then you could use Disallow: /tops
to avoid it being crawled.
精彩评论