开发者

How to parse content with <pre>?

I am using jsoup to parse a number of things.

I am trying to parse this tag

开发者_如何学Python<pre>HEllo Worl<pre>

But just cant get it to work.

How would i parse this using jsoup?\

    Document jsDoc = null;
     jsDoc = Jsoup.connect(url).get();
 Elements titleElements = jsDoc.getElementsByTag("pre");

Here is what i have so far.


Works fine for me with latest Jsoup:

String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");

for (Element pre : pres) {
    System.out.println(pre.text());
}

Result:

Hello World

If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by

System.out.println(document.html());

Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜