How do I get the source of a given URL from a servlet?

2023-03-29 20:00 问答作者：

I want to read a source code (HTML tags) of a given URL from my servlet.

For example, URL is 开发者_如何学JAVAhttp://www.google.com and my servlet needs to read the HTML source code. Why I need this is, my web application is going to read other web pages and get useful content and do something with it.

Lets say, my application shows a shop list of one category in a city. How that list is generated is, my web application (servlet) goes through a given web page which is displaying various shops and read content. With the source code my servlet filters that source and get useful details. Finally creates the list (because my servlet has no access to the given URL's web applications database).

Any know any solution? (specially I need this to do in servlets) If do you think that there is another best way to get details from another site, please let me know.

Thank you

You don't need servlet to read data from a remote server. You can just use java.net.URL or java.net.URLConnection class to read remote content from HTTP server. For example,

InputStream input = (InputStream) new URL("http://www.google.com").getContent();

Take a look at jsoup for fetching and parsing the HTML.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

What you are trying to do is called web scraping. Kayak and similar websites do it. Do search for it on web ;) Well in java you can do this.

URL url = new URL(<your URL>);

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
StringBuffer response = new StringBuffer();

while ((inputLine = in.readLine()) != null) {
  response.append(inputLine + "\n");
}

in.close();

response will give you complete HTML content returned by that URL.

As writing above, you don't need servlet for this purpose. Servlet API is used for responsing to requests, servlet container is running on the server side. If i understand you right, you don't need any server for this purpose. You need just simple http client emulator. I hope that following example will help you:

import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class SimpleHttpClient {

public String execute() {

        HttpClient httpClient = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet("google.com");
        StringBuilder content = new StringBuilder();

        try {
            HttpResponse response = httpClient.execute(httpGet);

            int bufferLength = 1024;
            byte[] buffer = new byte[bufferLength];
            InputStream is = response.getEntity().getContent();

            while (is.read(buffer) != -1) {
                content.append(new String(buffer, "UTF-8"));
            }
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } 
        return content.toString();
    }
}

There are several solutions.

The simplest one is using regular expressions. If you just want to extract links from tags like <a href="THE URL"> user regular expression like <a\s+href\s*=\s*["']?(.*?)["']\s*/>. The group(1) contains URL. Now just create Matcher and iterate over your document while matcher.find() is true.

The next solution is using XML parser to parse HTML. This will work fine if you sites are written using well formatted HTML (XHTML). Since it is not always true this solution is applicable for selected sites only.

The next solution is using the java built-in HTML parser: http://java.sun.com/products/jfc/tsc/articles/bookmarks/

The next, most flexible is way is using the "real" html parser and even better java based HTML browser: Java HTML Parsing

Now it depends on details of your task. If parsing of static anchor tags is enough, user regular expressions. If not choose one of the next ways.

As people said, you may use core classes java.net.URL and java.net.URLConnection for fetch webpages. But more useful for that purpose is Apache HttpClient. Look for docs & examples here: http://hc.apache.org/

继续阅读：servlets web-scraping

How do I get the source of a given URL from a servlet?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？