Crawler leaves lots of ESTABLISHED TCP sockets to some servers
I've got a Java web crawler. I've noticed that for a small number of servers I crawl I am left with a large number of ESTABLISHED sockets:
joel@bohr:~/tmp/test$ lsof -p 6760 | grep TCP
java 6760 joel 105u IPv6 96546 0t0 TCP bohr:55602->174.143.223.193:www (ESTABLISHED)
java 6760 joel 109u IPv6 96574 0t0 TCP bohr:55623->174.143.223.193:www (ESTABLISHED)
java 6760 joel 110u IPv6 96622 0t0 TCP bohr:55644->174.143.223.193:www (ESTABLISHED)
java 6760 joel 111u IPv6 96674 0t0 TCP bohr:55665->174.143.223.193:www (ESTABLISHED)
There could be many tens of these to any one server & I cann't figure out why they are being left open.
I'm using HttpURLConnection
to establish a connection and read data. HTTP 1.1 and keep-alive
is on (by default). It's my understanding that the underlying tcp socket to a remote server will be re-used by Java's HttpURLConnection
, so long as I close the input/error stream, and all data is read from the stream. It's also my understanding that if an exception is thrown, then so l开发者_StackOverflow社区ong as the input/error stream is closed (if not null) then the socket, although not re-used again, will be closed. (java handling of http-keepalive)
My abbreviated code looks like this:
InputStream is = null;
try {
HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection();
conn.setReadTimeout(10000);
conn.setConnectTimeout(10000);
conn.setRequestProperty("User-Agent", userAgent);
conn.setRequestProperty("Accept", "text/html,text/xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
conn.setRequestProperty("Accept-Encoding", "gzip deflate");
conn.setRequestProperty("Accept-Language", "en-gb,en;q=0.5");
conn.connect();
try {
int responseCode = conn.getResponseCode();
is = conn.getInputStream();
} catch (IOException e) {
is = conn.getErrorStream();
if (is != null){
// consume the error stream, http://download.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html
StreamUtils.readStreamToBytes(is, -1 , MAX_LN);
}
throw e;
}
String type = conn.getContentType();
byte[] response = StreamUtils.readStream(is);
// do something with content
} catch (Exception e) {
conn.disconnect(); // don't try to re-use socket - just be done with it.
throw e;
} finally {
if (is != null) {
is.close();
}
}
I've noticed that for a site where this is happening I get a lot of IOExceptions thrown when making GET requests, due to:
java.net.ProtocolException: Server redirected too many times (20)
I'm pretty sure I'm handling this, closing the socket properly. Could it really be this, or something else I'm doing wrong? Could it be a result of mis-using keep-alive - and if so how to fix it? I'd rather not have to turn keep-alive off to fix the problem.
EDIT: I've tested setting the following property:
conn.setRequestProperty("Connection", "close"); // supposed to disable keep-alive
Sending the Connection: close
header disabled persistent tcp connections and all sockets are eventually cleaned up. So, it would seem that the problem I am seeing is indeed to do with keep-alive
and sockets not being closed correctly, even after closing the input stream.
EDIT2 - could it be that one socket is created everytime the request is redirected? Where this problem is noticeable the request is being redirected 20 times before the exception above is thrown. If this were the case is there a way of limiting the number of redirects on a URLConnection?
You need to move conn.disconnect()
into your finally
section. As it is you only disconnect if there's an exception thrown.
精彩评论