Removing Html tags except few specific ones from String in java
My input is plain text string and requirement is to remove all html tags except few specific tags like:
<p>
<li>
<u>
<li>
If these specific tags have attributes like class
or id
, I want to remove these attributes.
A few examples:
<a href = "#">Link</a> -> 开发者_StackOverflow中文版 Link
<p>paragraph</p> -> <p>paragraph</p>
<p class="class1">paragraph</p> -> <p>paragraph</p>
I have gone through this Remove HTML tags from a String but it does not answer my question completely.
Can it be handled by a set of regex's or could I make use of some library?
I tried JSoup and It seems to be able to handle all such cases. Here is example code.
public String clean(String unsafe){
Whitelist whitelist = Whitelist.none();
whitelist.addTags(new String[]{"p","br","ul"});
String safe = Jsoup.clean(unsafe, whitelist);
return StringEscapeUtils.unescapeXml(safe);
}
For input string
String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";
I get following output which is pretty much I require.
<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>
For simple HTML, this may be sufficient:
// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));
Hope that helps.
精彩评论