开发者

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?

I have parsed an xml and get some output as body wh开发者_高级运维ich has html tags now i want to remove the tags and use the text.

thanks in advance!!!!


You can use HTML Parser like JSoup

For example HTML is

<div style="height:240px;"><br>test: example<br>test1:example1</div>

You can get the html using

Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();


Try a HTML Parser.

If the HTML is escaped, i.e. &lt; instead of < you might have to decode first.


Considering your requirements you might try Jericho HTML Parser

Take a look at TextExtractor class:

Using the default settings, the source segment: "<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>" produces the text "One Two Three".


If all you want to do is remove HTML tags from a string, you can do this:

String output = input.replaceAll("(?s)\\<.*?\\>", " ");
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜