Downloading a web page. OK with wget, fails with java
I'm trying to download the following page: http://structureddata.wikispaces.com/Test
wget without any option fails:
wget "http://structureddata.wikispaces.com/Test"
(...) connect to session.wikispaces.com insecurely, use `--no-check-certificate'
with --no-check-certificate, it works
wget --no-check-ce开发者_如何学Gortificate "http://structureddata.wikispaces.com/Test"
grep Hello Test
Hello World
Now, i would like to download the same URL with java, but the following simple program:
import java.net.*;
import java.io.*;
public class Test
{
public static void main(String args[])
{
int c;
try
{
InputStream in=new URL("http://structureddata.wikispaces.com/Test").openStream();
while((c=in.read())!=-1) System.out.print((char)c);
in.close();
}
catch(Throwable err)
{
err.printStackTrace();
}
}
}
returns nothing
what should I do to download the page with java ?
Many thanks,
Ppierre
The Java URL interface is fairly low-level; it does not automatically do things like follow redirects. Your code above is getting no content to print out because there is none.
By doing something like the below, you'll see that what you are getting is an HTTP 302 response -- a redirect.
URL url = new URL("http://structureddata.wikispaces.com/Test");
URLConnection urlConnection = url.openConnection();
Map<String, List<String>> headers = urlConnection.getHeaderFields();
Set<Map.Entry<String, List<String>>> entrySet = headers.entrySet();
for (Map.Entry<String, List<String>> entry : entrySet) {
String headerName = entry.getKey();
System.out.println("Header Name:" + headerName);
List<String> headerValues = entry.getValue();
for (String value : headerValues) {
System.out.print("Header value:" + value);
}
System.out.println();
System.out.println();
}
I'd suggest using a library like HTTPClient which will handle more of the protocol for you.
(credit where it is due: Copied the above code from here.)
You may want to look at commons-httpclient, this code returns the page no problem
final HttpClient client = new HttpClient();
final GetMethod method = new GetMethod("http://structureddata.wikispaces.com/Test");
try {
if (HttpStatus.SC_OK == client.executeMethod(method)) {
System.out.println(IOUtils.toString(method.getResponseBodyAsStream()));
} else {
throw new IOException("Unable to load page, error " + method.getStatusLine());
}
} finally {
method.releaseConnection();
}
The problem is that it returns a 302
redirect response to a https
url. Since the initial request is http
and the target is https
, the URLConnection
won't automatically follow the redirect (it will however do when the target is using the same scheme).
After some observation I concluded that it goes to https
to request some authentication token which in turn get redirected to a http
url again with the authentication token as request parameter. So, it should be following redirects from http
to https
and then http
with the actual page content.
The following works here.
public static void main(String... args) throws Exception {
// First request.
URLConnection connection = new URL("http://structureddata.wikispaces.com/Test").openConnection();
// Go to the redirected https page to obtain authentication token.
connection = new URL(connection.getHeaderField("location")).openConnection();
// Re-request the http page with the authentication token.
connection = new URL(connection.getHeaderField("location")).openConnection();
// Show page.
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
for (String line; ((line = reader.readLine()) != null);) {
System.out.println(line);
}
} finally {
if (reader != null) try { reader.close(); } catch (IOException ignore) {}
}
}
I however do agree that Commons HttpComponents Client is a better tool for the job.
精彩评论