Preventing Rogue spiders from Indexing directory

2023-02-07 19:25 问答作者：

We have a secure website (developed in .NET 2.0/C# running on Windows server and IIS 5) to which members have to log in and then they can view some PDF files stored in a virtual directory. To prevent spiders from crawling this website, we have a robots.txt that will disallow all user agents开发者_StackOverflow社区 from coming in. However, this will NOT prevent Rogue spiders from indexing the PDF files since they will disregard the robots.txt commands. Since the documents are to be secure, I do not want ANY spiders getting into this virtual directory (not even the good ones).

Read a few articles on the web and wondering how programmers (rather than web masters) have solved this problem in their applications, since this seems like a very common problem. There are many options on the web but am looking for something that is easy and elegant.

Some options that I have seen, but seem to be weak. Listed here with their cons:

Creating a Honeypot/tarpit that will allow rogue spiders to get in and then will list their IP address. Cons : this can also block valid users coming from the same IP, need to manually maintain this list or have some way for members to remove themselves from the list. We dont have a range of IPs that valid members will use, since the website is on the internet.
Request header analysis : However, the rogue spiders use real agent names so this is pointless.
Meta-Robots tag: Cons: only obeyed by google and other valid spiders.

There was some talk about using .htaccess which is suppose to be good but thats only will apache, not IIS.

Any suggestions are very much appreciated.

EDIT: as 9000 pointed out below, rogue spiders should not be able to get into a page that requires a login. I guess the question is 'how to prevent someone who knows the link form requesting the PDF file without logging into the website'.

I see a contradiction between

members have to log in and then they can view some PDF files stored in a virtual directory

and

this will NOT prevent Rogue spiders from indexing the PDF files

How come any unauthorized HTTP request to this directory ever gets served with something else than code 401? The rouge spiders certainly can't provide an authorization cookie. And if the directory is accessible to them, what is 'member login' then?

Probably you need to serve the PDF files via a script that checks authorization. I think IIS is capable of requiring an authorization just for a directory access, too (but I don't really know).

I assume that your links to PDFs come from a known location. You can check the Request.UrlReferrer to make sure users are coming from this internal / known page to access the PDFs.

I would definitely force downloads to go through a script where you can check that a user is in fact logged in to the site before allowing the download.

protected void getFile(string fileName) {

/* 
    CHECK AUTH / REFERER HERE
*/

    string filePath = Request.PhysicalApplicationPath + "hidden_PDF_directory/" + fileName;

    System.IO.FileInfo fileInfo = new System.IO.FileInfo(filePath);

    if (fileInfo.Exists) {
        Response.Clear();
        Response.AddHeader("Content-Disposition", "attachment; filename=" + fileInfo.Name);
        Response.AddHeader("Content-Length", fileInfo.Length.ToString());
        Response.ContentType = "application/pdf";
        Response.WriteFile(fileInfo.FullName);
        Response.End();
    } else {

/*
    ERROR
*/

    }
}

Untested, but this should give you an idea at least.

I'd also stay away from robots.txt since people will often use this to actually look for things you think you're hiding.

Here is what I did (expanding on Leigh's code).

Created an HTTPHandler for PDF files, created a web.config on the secure directory and configured the Handler to handle PDFs.
In the handler, I check to see if the user is logged in using a session variable set by the application.
If the user has the session variable, I create a fileInfo object and send it on the response. Note : don't do 'context.Response.End()', also the 'Content-Disposition' is obsolete.

So now, where even there is a request for a PDF on the secure directory, the HTTP handler gets the request and checks to see if the user is logged in. If not, display error message, else display the file.

Not sure if there is an performance hit since I am creating the fileInfo objects and sending that, rather than sending the file that already exists. The thing is that you can't Server.Transfer or Response.Redirect to the *.pdf file since you are creating an infinite loop and the response will never get returned to the user.

继续阅读：security virtual-directory web-crawler

Preventing Rogue spiders from Indexing directory

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？