Using Java to pull data from a webpage?

2023-03-09 08:35 问答作者：

I'm attemptin开发者_如何学编程g to make my first program in Java. The goal is to write a program that browses to a website and downloads a file for me. However, I don't know how to use Java to interact with the internet. Can anyone tell me what topics to look up/read about or recommend some good resources?

The simplest solution (without depending on any third-party library or platform) is to create a URL instance pointing to the web page / link you want to download, and read the content using streams.

For example:

    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.net.URLConnection;
    
    
    public class DownloadPage {
    
        public static void main(String[] args) throws IOException {
            
            // Make a URL to the web page
            URL url = new URL("http://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
            
            // Get the input stream through URL Connection
            URLConnection con = url.openConnection();
            InputStream is = con.getInputStream();
            
            // Once you have the Input Stream, it's just plain old Java IO stuff.
            
            // For this case, since you are interested in getting plain-text web page
            // I'll use a reader and output the text content to System.out.
            
            // For binary content, it's better to directly read the bytes from stream and write
            // to the target file.          
            
            try(BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
                String line = null;
            
                // read each line and write to System.out
                while ((line = br.readLine()) != null) {
                    System.out.println(line);
                }
            }
        }
    }

Hope this helps.

The Basics

Look at these to build a solution more or less from scratch:

Start from the basics: The Java Tutorial's chapter on Networking, including Working With URLs
Make things easier for yourself: Apache HttpComponents (including HttpClient)

The Easily Glued-Up and Stitched-Up Stuff

You always have the option of calling external tools from Java using the exec() and similar methods. For instance, you could use wget, or cURL.

The Hardcore Stuff

Then if you want to go into more fully-fledged stuff, thankfully the need for automated web-testing as given us very practical tools for this. Look at:

HtmlUnit (powerful and simple)
Selenium, Selenium-RC
WebDriver/Selenium2 (still in the works)
JBehave with JBehave Web

Some other libs are purposefully written with web-scraping in mind:

JSoup
Jaunt

Some Workarounds

Java is a language, but also a platform, with many other languages running on it. Some of which integrate great syntactic sugar or libraries to easily build scrapers.

Check out:

Groovy (and its XmlSlurper)
or Scala (with great XML support as presented here and here)

If you know of a great library for Ruby (JRuby, with an article on scraping with JRuby and HtmlUnit) or Python (Jython) or you prefer these languages, then give their JVM ports a chance.

Some Supplements

Some other similar questions:

Scrape data from HTML using Java
Options for HTML Scraping

Here's my solution using URL and try with resources phrase to catch the exceptions.

/**
 * Created by mona on 5/27/16.
 */
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ReadFromWeb {
    public static void readFromWeb(String webURL) throws IOException {
        URL url = new URL(webURL);
        InputStream is =  url.openStream();
        try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }
        }
        catch (MalformedURLException e) {
            e.printStackTrace();
            throw new MalformedURLException("URL is malformed!!");
        }
        catch (IOException e) {
            e.printStackTrace();
            throw new IOException();
        }

    }
    public static void main(String[] args) throws IOException {
        String url = "https://madison.craigslist.org/search/sub";
        readFromWeb(url);
    }

}

You could additionally save it to file based on your needs or parse it using XML or HTML libraries.

Since Java 11 the most convenient way it to use java.net.http.HttpClient from the standard library.

Example:

HttpClient client = HttpClient.newBuilder()
     .version(Version.HTTP_1_1)
     .followRedirects(Redirect.NORMAL)
     .connectTimeout(Duration.ofSeconds(20))
     .proxy(ProxySelector.of(new InetSocketAddress("proxy.example.com", 80)))
     .authenticator(Authenticator.getDefault())
     .build();

HttpRequest request = HttpRequest.newBuilder()
     .uri(URI.create("httpss://foo.com/"))
     .timeout(Duration.ofMinutes(2))
     .GET()
     .build();

HttpResponse<String> response = client.send(request, BodyHandlers.ofString());

System.out.println(response.statusCode());

System.out.println(response.body());

I use the following code for my API:

try {
        URL url = new URL("https://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
        InputStream content = url.openStream();
        int c;
        while ((c = content.read())!=-1) System.out.print((char) c);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException ie) {
        ie.printStackTrace();
    }

You can catch the characters and convert them to string.

Using Java to pull data from a webpage?

The Basics

The Easily Glued-Up and Stitched-Up Stuff

The Hardcore Stuff

Some Workarounds

Some Supplements

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The Basics

The Easily Glued-Up and Stitched-Up Stuff

The Hardcore Stuff

Some Workarounds

Some Supplements

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？