Apache HTTPClient returns an empty page
I am using the Apache HTTPClient for Java and I'm facing a really strange issue. Sometimes when I try to get a dynamically generated page it returns its actual content, but other times (with another parameter) all I get is a short sequence of \t,\r and \n.
How could I track what's going on on the different cases in order to find where is the bug?
My usage of the library is pretty straightforward, all I do is this few calls on an initialized HTTPClient object:
String content = "/pageIwant.jsp?parameter=10101010";
HttpG开发者_开发技巧et request = new HttpGet(content);
HttpResponse response = client.execute(targetHost, request);
HttpEntity entity = response.getEntity();
String page = EntityUtils.toString(entity);
The way I would approach this to start by attempting to fetch the same page using a web browser. If you cannot get that to work, it is probably safe to conclude that the real problem is with the server. You'll need to talk to the server's support staff.
If a browser works, try and repeat the process using the wget
utility. If wget
gives you problems, go back to your browser and find out exactly what headers the browser is sending in the HTTP request and try to get wget
to use the same headers. Once you've got wget
to work, make a note of the headers.
Finally return to your Java code, and modify it so that the HTTP request headers it sends are the same as those that work for wget
.
Yes, I have to authenticate using the proxy of my university and then I am able to access all the data. The proxy authentication is working flawlessly for the 'journal page' and even for other sites, so I'd exclude that the problem is related to that.
I think you may have excluded the real problem. @BalasC is not talking about proxy authentication. Rather he is talking about authentication at the IEEE site. And just because one part of the site appears to work without authentication does not mean it all will. (However, I'd have thought that the site would respond with a "FORBIDDEN" or "AUTHORIZATION REQUIRED" error rather than delivering strange content.)
Another possibility is that the site trying to prevent "screen scraping" of their content using automatic tools. Check the "Terms of Service" for the site to see if what you are trying to do is allowed. (You may choose to ignore the ToS and circumvent the technical measures, but then you might find yourself or your organization IP blocked, or you might be on the end of cease-and-desist letters talking about copyright violation.)
I found the solution to my problem, I was missing some header informations that apparently are required just from part of the dynamic page.
To solve my issue I first used wireshark to see the communications between the browser and the server and then I added all the headers I was missing.
I found out that in my case I needed to specify the 'Accept-Language' data
精彩评论