开发者

WebRequest "HEAD" light weight alternative

I recently discovered that the following does not work with certain sites, such as IMDB.com.

class Program
    {
        static void Main(string[] args)
        {
            try
            {
            开发者_如何学Python    System.Net.WebRequest wc = System.Net.WebRequest.Create("http://www.imdb.com"); //args[0]);

                ((HttpWebRequest)wc).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.2.153.1 Safari/525.19";
                wc.Timeout = 1000;
                wc.Method = "HEAD";
                WebResponse res = wc.GetResponse();
                var streamReader = new System.IO.StreamReader(res.GetResponseStream());

                Console.WriteLine(streamReader.ReadToEnd());
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
        }
    }

It returns an HTTP 405 ( Method Not Allowed ). My problem is, I use code very similar to the above to check if a link is valid and the vast majority of times it works correctly. I can switch it to method equal GET and it works ( with an increase in timeout ), but this slows things down by an order of magnitude. I am assuming the 405 response is a server configuration on IMDB's server side.

Is there a way for me to do the same thing as above, in a light weight manner in .NET? Or, is there a way to fix the above code so it works as a GET request that works with imdb?


Open the connection yourself with a socket (instead of an HttpRequest or WebClient), and close the stream as soon as you've read the status code. Fortunately the status code comes near the top of the response stream :)


You'll have to clarify what you mean by "lightweight". What are you trying to accomplish?

Whether or not you can use GET/POST/HEAD/DELETE/etc will depend on the URL and what's configured in the application that is running on the server at that URL.

If all you're trying to do is see if you can make a connection without actually downloading the content you could maybe try just initiating a connection to port 80 using sockets, but there isn't really reliable or universally supported way just by changing the HTTP method.


If HEAD returns a 405, that means the server doesn't support HEAD (at least for that URL) and you'll have fall back to GET instead. The majority of sites should support HEAD, so you probably want to do HEAD by default, but if it throws a 405, you could maybe fall back to GET for that domain. Or maybe you want to try HEAD first for each request; YMMV.

If the server requires GET and you want to reduce network traffic, you could try doing a conditional GET and/or a partial GET (see e.g. RFC2616). I've never tried doing those with WebRequest but I think it lets you add custom outgoing HTTP headers, so you should be able to do it.

Also, don't forget that, if you're writing a spider (which you clearly are), you should respect the server's robots.txt, and it's also courteous to throttle your requests to something like one request every two seconds, so you don't slashdot the server.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜