开发者

Strip of tags from text extracted from XML

I am parsing XML documents. I do getTextContent() to get text from particular section that I want. The text that I get has tags like

<italic> </italic>
<sub> </sub>

..and some more. I want to strip of these tags and just keep the text, irrespective of what the tags are.

My document looks like this

<article>
   <sec>Section 1</sec>  
   <sec>Section 2
      <title>Title1</title>
      <sec>
         <title>Subtitle1</title>
         <p>........<italic> </italic>...</p>
      </sec>
      <sec>
         <title>Subtitle2</title>
         <p>........<sub> </sub>...</p>
      </sec>
   </sec>
</article>

I need all the text in <p>...</p> without the tags in it. How can I go about it? I was thinking of identifying all the ta开发者_如何学Pythongs and replacing it with "". But there has to be a better way.

Thanks


You could apply this reg ex to the results of getTextContent()

String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");


You could use a perl script to go through the file then use s/ \< .* \> //xg; to get rid of all the tags.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜