Extract text between html tags parsed from xml
Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body wh开发者_高级运维ich has html tags now i want to remove the tags and use the text.
thanks in advance!!!!
You can use HTML Parser like JSoup
For example HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();
Try a HTML Parser.
If the HTML is escaped, i.e. <
instead of <
you might have to decode first.
Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three"
.
If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");
精彩评论