Reading website's contents into string

2023-03-02 17:31 问答作者：

Currently I'm working on a class that can be used to read the contents of the website specified by the url. I'm just beginning my adventures with java.io and j开发者_StackOverflow中文版ava.net so I need to consult my design.

Usage:

TextURL url = new TextURL(urlString);
String contents = url.read();

My code:

package pl.maciejziarko.util;

import java.io.*;
import java.net.*;

public final class TextURL
{
    private static final int BUFFER_SIZE = 1024 * 10;
    private static final int ZERO = 0;
    private final byte[] dataBuffer = new byte[BUFFER_SIZE];
    private final URL urlObject;

    public TextURL(String urlString) throws MalformedURLException
    {
        this.urlObject = new URL(urlString);
    }

    public String read() 
    {
        final StringBuilder sb = new StringBuilder();

        try
        {
            final BufferedInputStream in =
                    new BufferedInputStream(urlObject.openStream());

            int bytesRead = ZERO;

            while ((bytesRead = in.read(dataBuffer, ZERO, BUFFER_SIZE)) >= ZERO)
            {
                sb.append(new String(dataBuffer, ZERO, bytesRead));
            }
        }
        catch (UnknownHostException e)
        {
            return null;
        }
        catch (IOException e)
        {
            return null;
        }

        return sb.toString();
    }

    //Usage:
    public static void main(String[] args)
    {
        try
        {
            TextURL url = new TextURL("http://www.flickr.com/explore/interesting/7days/");
            String contents = url.read();

            if (contents != null)
                System.out.println(contents);
            else
                System.out.println("ERROR!");
        }
        catch (MalformedURLException e)
        {
            System.out.println("Check you the url!");
        }
    }
}

My question is: Is it a good way to achieve what I want? Are there any better solutions?

I particularly didn't like sb.append(new String(dataBuffer, ZERO, bytesRead)); but I wasn't able to express it in a different way. Is it good to create a new String every iteration? I suppose no.

Any other weak points?

Thanks in advance!

Consider using URLConnection instead. Furthermore you might want to leverage IOUtils from Apache Commons IO to make the string reading easier too. For example:

URL url = new URL("http://www.example.com/");
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();  // ** WRONG: should use "con.getContentType()" instead but it returns something like "text/html; charset=UTF-8" so this value must be parsed to extract the actual encoding
encoding = encoding == null ? "UTF-8" : encoding;
String body = IOUtils.toString(in, encoding);
System.out.println(body);

If you don't want to use IOUtils I'd probably rewrite that line above something like:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[8192];
int len = 0;
while ((len = in.read(buf)) != -1) {
    baos.write(buf, 0, len);
}
String body = new String(baos.toByteArray(), encoding);

I highly recommend using a dedicated library, like HtmlParser:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);
System.out.println (list.toHtml ());

Writing your own html parser is such a loose of time. Here is its maven dependency. Look at its JavaDoc for digging into its features.

Looking at the following sample should be convincing:

Parser parser = new Parser(url);
NodeList movies = parser.extractAllNodesThatMatch(
    new AndFilter(new TagNameFilter("div"),
    new HasAttributeFilter("class", "movie")));

Unless this is some sort of exercise that you want to code for the sake of learning ... I would not reinvent the wheel and I would use HttpURLConnection.

HttpURLConnection provides good encapsulation mechanisms to deal with the HTTP protocol. For instance, your code doesn't work with HTTP redirections, HttpURLConnection would fix that for you.

You can wrap your InputStream in a InputStreamReader and can use it's read() method to read character data directly (note that you should specify the encoding when creating the Reader, but finding out the encoding of arbitrary URLs is non-trivial). Then simply call sb.append() with the char[] you just read (and the correct offset and length).

I know this is an old question, but I'm sure other people will find it too.

If you don't mind an additional dependency, here's a very simple way

Jsoup.connect("http://example.com/").get().toString()

You'll need a Jsoup library, but you can quickly add it with maven/gradle and it also allows to manipulate the contents of the page and find specific nodes.

Hey Please use these lines of codes , it will help u..

 <!DOCTYPE html>
    <html>
        <head>
            <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
            <title>JSP Page</title>

        </head>
        <body>
            <h1>Hello World!</h1> 






        URL uri= new URL("Your url");
        URLConnection ec = uri.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                ec.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        out.println(a.toString());

继续阅读：io networking

Reading website's contents into string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？