Remove Some HTML tags with RegExp and Java
I want to remove HTML tags from a String. This is easy, I know, I did so:
public String removerTags(String html)
{
return html.replaceAll("\\<(/?[^\\>]+)\\>", " ").replaceAll("\\s+", " ").trim();
}
The problem is that I do not want to remove all the tags .. I want the tag
<span style=\"background-color: yellow\"> (text) </ span>
stay intact in the string ..
I'm using this as a kind of "highlight" in the search for a web application using GWT I'm doing ...
And I need to do this, because if the search finds text that contains some HTML t开发者_如何学Pythonag (the indexing is done by Lucene), and it is broken, the appendHTML from safeHTMLBuilder are unable to mount a String.
You can do this in a way fairly good?
Hugs.
I strongly suggest you use JSoup for this task. Regular expressions simply aren't well suited for this task imo. And with JSoup this is basically a simple, readable and easily maintainable one-liner!
Have a look at the JSoup.clean
method, and perhaps this article:
- Sanitize Untrusted HTML
I found a solution for this problem using only regular expressions:
public static String filterHTMLTags(String html) {
// save valid tags:
String striped = html.replaceAll("(?i)\\<(\\s*/?(a|h\\d|b|i|em|cite|code|strong|pre|br).*?/?)\\>", "{{$1}}");
// remove all tags:
striped = striped.replaceAll("\\<(/?[^\\>]+)\\>", " ");
// restore valid tags:
striped = striped.replaceAll("\\{\\{(.+?)\\}\\}", "<$1>");
return striped;
}
Be sure that you don´t use "{{ ... }}" in your html content. You can change this "save sequence" easily. The valid tags are defined in the list of first replaceAll regular expression:
(a|h\d|b|i|em|cite|code|strong|pre|br)
The "h\d" in above list means "h1, h2, ..." are valid tags.
I tested this with this code:
public static void main (String[] args) {
String teste = " <b>test bold chars</b> <BR/> <div>test div</div> \n" +
" link: <a href=\"test.html\">click here</a> <br />\n" +
" <script>bad script</script> <notpermitted/>\n";
System.out.println("teste: \n"+teste);
System.out.println("\n\n\nstriped: \n"+filterHTMLTags(teste));
}
Bye, Sergio Figueiredo - My blog
A library I've used to great effect in the past is OWASP AntiSamy
This definitely allows whitelisting / blacklisting of tags. It may be worth a look.
精彩评论