Need help in getting HTML of a website in Java
I got some code from java httpurlconnection cutting off html and I am pretty much the same code to fetch html from websites in Java. Except for one particular website that I am unable to make this code work with:
I am trying to get HTML from this website:
http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289
But I keep getting junk characters. Although it works very well with any other website like http://www.google.com.
And this is the code that I am using:
pub开发者_StackOverflowlic static String PrintHTML(){
URL url = null;
try {
url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
} catch (MalformedURLException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
HttpURLConnection connection = null;
try {
connection = (HttpURLConnection) url.openConnection();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
try {
System.out.println(connection.getResponseCode());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String html = builder.toString();
System.out.println("HTML " + html);
return html;
}
I don't understand why it doesn't work with the URL that I mentioned above.
Any help will be appreciated.
That site is incorrectly gzipping the response regardless of the client's capabilities. Normally a server should only gzip the response whenever the client supports it (by Accept-Encoding: gzip
). You need to ungzip it using GZIPInputStream
.
reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));
Note that I also added the right charset to the InputStreamReader
constructor. Normally you'd like to extract it from the Content-Type
header of the response.
For more hints, see also How to use URLConnection to fire and handle HTTP requests? If all what you after all want is parsing/extracting information from the HTML, then I strongly recommend to use a HTML parser like Jsoup instead.
精彩评论