开发者

Preventing Rogue spiders from Indexing directory

We have a secure website (developed in .NET 2.0/C# running on Windows server and IIS 5) to which members have to log in and then they can view some PDF files stored in a virtual directory. To prevent spiders from crawling this website, we have a robots.txt that will disallow all user agents开发者_StackOverflow社区 from coming in. However, this will NOT prevent Rogue spiders from indexing the PDF files since they will disregard the robots.txt commands. Since the documents are to be secure, I do not want ANY spiders getting into this virtual directory (not even the good ones).

Read a few articles on the web and wondering how programmers (rather than web masters) have solved this problem in their applications, since this seems like a very common problem. There are many options on the web but am looking for something that is easy and elegant.

Some options that I have seen, but seem to be weak. Listed here with their cons:

  1. Creating a Honeypot/tarpit that will allow rogue spiders to get in and then will list their IP address. Cons : this can also block valid users coming from the same IP, need to manually maintain this list or have some way for members to remove themselves from the list. We dont have a range of IPs that valid members will use, since the website is on the internet.

  2. Request header analysis : However, the rogue spiders use real agent names so this is pointless.

  3. Meta-Robots tag: Cons: only obeyed by google and other valid spiders.

There was some talk about using .htaccess which is suppose to be good but thats only will apache, not IIS.

Any suggestions are very much appreciated.

EDIT: as 9000 pointed out below, rogue spiders should not be able to get into a page that requires a login. I guess the question is 'how to prevent someone who knows the link form requesting the PDF file without logging into the website'.


I see a contradiction between

members have to log in and then they can view some PDF files stored in a virtual directory

and

this will NOT prevent Rogue spiders from indexing the PDF files

How come any unauthorized HTTP request to this directory ever gets served with something else than code 401? The rouge spiders certainly can't provide an authorization cookie. And if the directory is accessible to them, what is 'member login' then?

Probably you need to serve the PDF files via a script that checks authorization. I think IIS is capable of requiring an authorization just for a directory access, too (but I don't really know).


I assume that your links to PDFs come from a known location. You can check the Request.UrlReferrer to make sure users are coming from this internal / known page to access the PDFs.

I would definitely force downloads to go through a script where you can check that a user is in fact logged in to the site before allowing the download.

protected void getFile(string fileName) {

/* 
    CHECK AUTH / REFERER HERE
*/

    string filePath = Request.PhysicalApplicationPath + "hidden_PDF_directory/" + fileName;

    System.IO.FileInfo fileInfo = new System.IO.FileInfo(filePath);

    if (fileInfo.Exists) {
        Response.Clear();
        Response.AddHeader("Content-Disposition", "attachment; filename=" + fileInfo.Name);
        Response.AddHeader("Content-Length", fileInfo.Length.ToString());
        Response.ContentType = "application/pdf";
        Response.WriteFile(fileInfo.FullName);
        Response.End();
    } else {

/*
    ERROR
*/

    }
}

Untested, but this should give you an idea at least.

I'd also stay away from robots.txt since people will often use this to actually look for things you think you're hiding.


Here is what I did (expanding on Leigh's code).

  1. Created an HTTPHandler for PDF files, created a web.config on the secure directory and configured the Handler to handle PDFs.

  2. In the handler, I check to see if the user is logged in using a session variable set by the application.

  3. If the user has the session variable, I create a fileInfo object and send it on the response. Note : don't do 'context.Response.End()', also the 'Content-Disposition' is obsolete.

So now, where even there is a request for a PDF on the secure directory, the HTTP handler gets the request and checks to see if the user is logged in. If not, display error message, else display the file.

Not sure if there is an performance hit since I am creating the fileInfo objects and sending that, rather than sending the file that already exists. The thing is that you can't Server.Transfer or Response.Redirect to the *.pdf file since you are creating an infinite loop and the response will never get returned to the user.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜