Regular expression for removing HTML tags from a string
I am looking for a regular expression to removing all HTML tags from a string in JSP.
Example 1
sampleString = "test string <i>in italics</i> continues";
Example 2
sampleString = "test string <i>in italics";
Example 3
sampleString = "test string <i"开发者_如何学JAVA;
The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.
Thanks in advance
Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.
As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.
String text = Jsoup.parse(html).text();
That's it. It has by the way also a HTML cleaner, if that is what you're actually after.
Since you're using JSP, you could also just use JSTL <c:out>
or fn:escapeXml()
to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).
<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
HTML tags will then not be interpreted, but just displayed as plain text.
<\/?font(\s\w+(\=\".*\")?)*\>
I used this little gem about a week ago to strip a variety of 12-year-old html tags, and it worked pretty great. Just replace 'font
' with whatever tag you're looking for, or with \w*
to get rid of all of them.
Edit removed '?' from the end of my string after realizing that could remove non-tag data from a file. Basically, this will consistently find case 1 and 2, but if used with case 3 (with the '?' appended to the end of the regex), caution should be used to ensure what is removed is a tag.
精彩评论