Java - Parsing HTML - get text
I am tring to get text from a website; when you c开发者_Go百科hange the language the html url have an "/en" inside, but the page that have the information that i want don't have.
http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>
The problem is that the website is in german and i want the text in english, and my script gets only the german version
Any ideas how can i do it?
java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine=null;
StringBuffer theText = new StringBuffer();
while ((inputLine = in.readLine()) != null)
theText.append(inputLine+"\n");
String html = theText.toString();
in.close();
String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");
That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language
request header.
URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...
Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:
String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3
That's all.
精彩评论