开发者

Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:

java.net.SocketException: Connection reset

The code that causes this is:

// Execute the request
HttpResponse response; 
try {
    response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
    return;//deep down in apache http sometimes throws a null pointer...  
}

For most servers it's just fine. But for others, it immediately throws a SocketException.

Example of site that causes immediate SocketException: http://www.bhphotovideo.com/

Work开发者_运维问答s great (as do most websites): http://www.google.com/

Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)

 HttpURLConnection c = (HttpURLConnection)url.openConnection();  
 BufferedInputStream in = new BufferedInputStream(c.getInputStream());  
 Reader r = new InputStreamReader(in);     

 int i;  
 while ((i = r.read()) != -1) {  
      source.append((char) i);  
 }  

So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.

Does anyone know what causes some servers to cause this exception?

Research so far:

  • Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.

  • It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"

  • This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.

Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.

Thanks all!

Solution

Honestly, I don't have a perfect solution, but it works, so that's good enough for me.

As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here: (linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)

SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact@yourcompany.com","ENTER URL"));
try {
    FetchedResult result = fetch.fetch("ENTER URL");
    System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
    e.printStackTrace();
}

The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.

Thanks for the help all.

Edit: TinyBixo

Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)

It's available here: https://github.com/juliuss/TinyBixo


First, to answer your question:

The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.

A few remarks:

(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.

(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?


These two settings will sometimes help:

 client.getParams().setParameter("http.socket.timeout", new Integer(0));
 client.getParams().setParameter("http.connection.stalecheck", new  Boolean(true));

The first sets the socket timeout to be infinite.


Try getting a network trace using wireshark, and augment that with log4j logging of the HTTPClient. That should show why the connection is being reset

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜