开发者

Regular expression for removing HTML tags from a string

I am looking for a regular expression to removing all HTML tags from a string in JSP.

Example 1

sampleString = "test string <i>in italics</i> continues";

Example 2

sampleString = "test string <i>in italics";

Example 3

sampleString = "test string <i"开发者_如何学JAVA;

The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.

Thanks in advance


Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.

As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.

String text = Jsoup.parse(html).text();

That's it. It has by the way also a HTML cleaner, if that is what you're actually after.

Since you're using JSP, you could also just use JSTL <c:out> or fn:escapeXml() to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).

<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />

HTML tags will then not be interpreted, but just displayed as plain text.


<\/?font(\s\w+(\=\".*\")?)*\>

I used this little gem about a week ago to strip a variety of 12-year-old html tags, and it worked pretty great. Just replace 'font' with whatever tag you're looking for, or with \w* to get rid of all of them.

Edit removed '?' from the end of my string after realizing that could remove non-tag data from a file. Basically, this will consistently find case 1 and 2, but if used with case 3 (with the '?' appended to the end of the regex), caution should be used to ensure what is removed is a tag.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜