Apache httpclient returning page before it loads?
I noticed a strange phenomenon when using the apa开发者_运维问答che httpclient libraries and I want to know why it occurs. I created some sample code to demonstrate. Consider the following code:
//Example URL
String url = "http://www.amazon.com/gp/offer-listing/05961580/ref=dp_olp_used?ie=UTF8";
GetMethod get = new GetMethod(url);
HttpMethodRetryHandler httpHandler = new DefaultHttpMethodRetryHandler(1, false);
get.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, httpHandler );
get.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
HttpConnectionManager connectionManager = new SimpleHttpConnectionManager();
HttpClient client = new HttpClient( connectionManager );
client.getParams().setParameter("http.useragent", FIREFOX );
String line;
StringBuilder stringBuilder = new StringBuilder();
String toStreamBody = null;
String toStringBody = null;
try {
int statusCode = client.executeMethod(get);
if( statusCode != HttpStatus.SC_OK ){
System.err.println("Internet Status: " + HttpStatus.getStatusText(statusCode) );
System.err.println("While getting page: " + url );
}
//toString
toStringBody = get.getResponseBodyAsString();
//toStream
InputStreamReader isr = new InputStreamReader(get.getResponseBodyAsStream())
BufferedReader rd = new BufferedReader(isr);
while ((line = rd.readLine()) != null) {
stringBuilder.append(line);
}
} catch (java.io.IOException ex) {
System.out.println( "Failed to get page: " + url);
} finally {
get.releaseConnection();
}
toStreamBody = stringBuilder.toString();
This code prints nothing:
System.out.println(toStringBody); // ""
This code prints the web page:
System.out.println(toStreamBody); // "Whole Page"
But it gets even stranger... Replace:
get.getResponseBodyAsString();
With:
get.getResponseBodyAsString(150000);
Now we get the error:
Failed to get page: http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8
I was unable to find another website besides for amazon that replicates this behavior but I assume there are others.
I am aware that according to the documentation at http://hc.apache.org/httpclient-3.x/performance.html
discourages the use of getResponseBodyAsString()
, it does not say that the page will not load, only that you may be at risk of an out of memory exception. Is it possible that getResponseBodyAsString()
is returning the page before it loads? Why does this only happen with amazon?
Did you test with any other URL?
The URL in code that you provided redirects with 302 to http://www.amazon.com/dp/05961580/?tag=stackoverfl08-20, which then returns 404 (not found).
HttpClient does not handle redirects: http://hc.apache.org/httpclient-3.x/redirects.html
精彩评论