开发者

Java replace all non-HTML Tags in a String

I'd like to replace all the tag-looking parts in a String if those are not valid HTML tags. A tag-looking part is something enclosed in <> brackets. Eg. <myemail@email.com> or &l开发者_如何学Got;hello> but <br>, <div>, and so on has to be kept.

Do you have any idea how to achieve this?

Any help is appreciated!

cheers,

balázs


You can use JSoup to clean HTML.

String cleaned = Jsoup.clean(html, Whitelist.relaxed());

You can either use one of the defined Whitelists or you can create your own custom one in which you specify which HTML elements you wish to allow through the cleaner. Everything else is removed.


Your specific example would be:

String html = "one two three <blabla> four <text> five <div class=\"bold\">six</div>";
String cleaned = Jsoup.clean(html, Whitelist.relaxed().addAttributes("div", "class"));
System.out.println(cleaned);

Output:

one two three  four  five 
<div class="bold">
 six
</div>


Have a look at the java.util.Scanner class - you can set a delimiter then see if the string matches HTML tag or not - you will have to build an Array of strings that should be ignored.


You may also want to include ending tags in your comparison algorithm. So you may want to look for a forward slash(html end tag) and strip it before your comparison.


If you do it in order to display untrusted data on the web page, simple removing of invalid tags is not enough. Take a look at OWASP AntiSamy.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜