开发者

Remove Some HTML tags with RegExp and Java

I want to remove HTML tags from a String. This is easy, I know, I did so:

public String removerTags(String html)  
    {  
        return html.replaceAll("\\<(/?[^\\>]+)\\>", " ").replaceAll("\\s+", " ").trim();  
    }  

The problem is that I do not want to remove all the tags .. I want the tag

<span style=\"background-color: yellow\"> (text) </ span>

stay intact in the string ..

I'm using this as a kind of "highlight" in the search for a web application using GWT I'm doing ...

And I need to do this, because if the search finds text that contains some HTML t开发者_如何学Pythonag (the indexing is done by Lucene), and it is broken, the appendHTML from safeHTMLBuilder are unable to mount a String.

You can do this in a way fairly good?

Hugs.


I strongly suggest you use JSoup for this task. Regular expressions simply aren't well suited for this task imo. And with JSoup this is basically a simple, readable and easily maintainable one-liner!

Have a look at the JSoup.clean method, and perhaps this article:

  • Sanitize Untrusted HTML


I found a solution for this problem using only regular expressions:

public static String filterHTMLTags(String html) {

    // save valid tags:
    String striped = html.replaceAll("(?i)\\<(\\s*/?(a|h\\d|b|i|em|cite|code|strong|pre|br).*?/?)\\>", "{{$1}}");
    // remove all tags:
    striped = striped.replaceAll("\\<(/?[^\\>]+)\\>", " ");
    // restore valid tags:
    striped = striped.replaceAll("\\{\\{(.+?)\\}\\}", "<$1>");

    return striped;
}

Be sure that you don´t use "{{ ... }}" in your html content. You can change this "save sequence" easily. The valid tags are defined in the list of first replaceAll regular expression:

(a|h\d|b|i|em|cite|code|strong|pre|br)

The "h\d" in above list means "h1, h2, ..." are valid tags.

I tested this with this code:

public static void main (String[] args) {

    String teste = " <b>test bold chars</b> <BR/> <div>test div</div> \n" +
            " link: <a href=\"test.html\">click here</a> <br />\n" +
            " <script>bad script</script> <notpermitted/>\n";

    System.out.println("teste: \n"+teste);
    System.out.println("\n\n\nstriped: \n"+filterHTMLTags(teste));
}

Bye, Sergio Figueiredo - My blog


A library I've used to great effect in the past is OWASP AntiSamy

This definitely allows whitelisting / blacklisting of tags. It may be worth a look.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜