开发者

Downloading a web page. OK with wget, fails with java

I'm trying to download the following page: http://structureddata.wikispaces.com/Test

wget without any option fails:

wget "http://structureddata.wikispaces.com/Test"
(...) connect to session.wikispaces.com insecurely, use `--no-check-certificate'

with --no-check-certificate, it works

wget --no-check-ce开发者_如何学Gortificate "http://structureddata.wikispaces.com/Test"
grep Hello Test
 Hello World

Now, i would like to download the same URL with java, but the following simple program:

import java.net.*;
import java.io.*;
public class Test
        {
        public static void main(String args[])
                {
                int c;
                try
                        {
                        InputStream in=new URL("http://structureddata.wikispaces.com/Test").openStream();
                        while((c=in.read())!=-1) System.out.print((char)c);
                        in.close();
                        }
                catch(Throwable err)
                        {
                        err.printStackTrace();
                        }
                }
        }

returns nothing

what should I do to download the page with java ?

Many thanks,

Ppierre


The Java URL interface is fairly low-level; it does not automatically do things like follow redirects. Your code above is getting no content to print out because there is none.

By doing something like the below, you'll see that what you are getting is an HTTP 302 response -- a redirect.

  URL url = new URL("http://structureddata.wikispaces.com/Test");

  URLConnection urlConnection = url.openConnection();
  Map<String, List<String>> headers = urlConnection.getHeaderFields();
  Set<Map.Entry<String, List<String>>> entrySet = headers.entrySet();
  for (Map.Entry<String, List<String>> entry : entrySet) {
    String headerName = entry.getKey();
    System.out.println("Header Name:" + headerName);
    List<String> headerValues = entry.getValue();
    for (String value : headerValues) {
      System.out.print("Header value:" + value);
    }
    System.out.println();
    System.out.println();
  }

I'd suggest using a library like HTTPClient which will handle more of the protocol for you.

(credit where it is due: Copied the above code from here.)


You may want to look at commons-httpclient, this code returns the page no problem

final HttpClient client = new HttpClient();
final GetMethod method = new GetMethod("http://structureddata.wikispaces.com/Test");
try {
    if (HttpStatus.SC_OK == client.executeMethod(method)) {
        System.out.println(IOUtils.toString(method.getResponseBodyAsStream()));
    } else {
        throw new IOException("Unable to load page, error " + method.getStatusLine());
    }
} finally {
    method.releaseConnection();
}


The problem is that it returns a 302 redirect response to a https url. Since the initial request is http and the target is https, the URLConnection won't automatically follow the redirect (it will however do when the target is using the same scheme).

After some observation I concluded that it goes to https to request some authentication token which in turn get redirected to a http url again with the authentication token as request parameter. So, it should be following redirects from http to https and then http with the actual page content.

The following works here.

public static void main(String... args) throws Exception {
    // First request.
    URLConnection connection = new URL("http://structureddata.wikispaces.com/Test").openConnection();

    // Go to the redirected https page to obtain authentication token.
    connection = new URL(connection.getHeaderField("location")).openConnection();

    // Re-request the http page with the authentication token.
    connection = new URL(connection.getHeaderField("location")).openConnection();

    // Show page.
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
        for (String line; ((line = reader.readLine()) != null);) {
            System.out.println(line);
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException ignore) {}
    }
}

I however do agree that Commons HttpComponents Client is a better tool for the job.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜