开发者

Java - Parsing HTML - get text

I am tring to get text from a website; when you c开发者_Go百科hange the language the html url have an "/en" inside, but the page that have the information that i want don't have.

http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92

html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>

The problem is that the website is in german and i want the text in english, and my script gets only the german version

Any ideas how can i do it?

java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
    String inputLine=null;
    StringBuffer theText = new StringBuffer();
    while ((inputLine = in.readLine()) != null)
            theText.append(inputLine+"\n");
    String html = theText.toString();
    in.close();

String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");


That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language request header.

URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...

Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:

String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3

That's all.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜