Strip of tags from text extracted from XML
I am parsing XML documents. I do getTextContent()
to get text from particular section that I want. The text that I get has tags like
<italic> </italic>
<sub> </sub>
..and some more. I want to strip of these tags and just keep the text, irrespective of what the tags are.
My document looks like this
<article>
<sec>Section 1</sec>
<sec>Section 2
<title>Title1</title>
<sec>
<title>Subtitle1</title>
<p>........<italic> </italic>...</p>
</sec>
<sec>
<title>Subtitle2</title>
<p>........<sub> </sub>...</p>
</sec>
</sec>
</article>
I need all the text in <p>...</p>
without the tags in it.
How can I go about it? I was thinking of identifying all the ta开发者_如何学Pythongs and replacing it with ""
. But there has to be a better way.
Thanks
You could apply this reg ex to the results of getTextContent()
String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");
You could use a perl script to go through the file then use s/ \< .* \> //xg;
to get rid of all the tags.
精彩评论