开发者

What's the best way to remove HTML from a string?

I recently started using the following RegEx in a ReReplace() function to strip HTML tags from a string using ColdFusion. Please note: I am not using this as protection from XSS or SQL injection; this is only to remove existing and safe HTML from a string before it's displayed in an HTML title attribute.

REReplaceNoCase(str,"<[^>]*>","","ALL")

In a semi-related question I asked how to modify my RegEx to include spaces and line breaks. I was told that using RegEx for this purpose is not appropriate and this post was referenced as an explanation.

I strongly suspect though that the regular expressions you have posted don't in fact work correctly. I'd advise you not to use regular expressions to parse HTML as HTML is not a regular langua开发者_开发知识库ge. Use an HTML parser instead. (Mark Byers)

If this is true, what is the appropriate tool for removing HTML from a string before it's displayed? (Baring in mind the HTML is already safe; it's sanitized before entry to the DB).

I am aware of HTMLEditFormat() and HTMLCodeFormat(), but those two functions do not provide what I need; the earlier replaces special characters with their HTML-escaped equivalents, while the latter does exactly the same but also wraps the string a <pre> tag.

What I would like to do is clean a string from HTML and line breaks before I display in an HTML title attribute <a title="My string without HTML goes here">...</a>

There are times when the HTML is not necessary. Say you wanted to display an excerpt from a post without the HTML stored along with it, for instance.


I disagree with the reasoning you quote. While HTML should not be parsed with regexen, stripping tags is perfect for them.

But you will want to be more careful than just <[^>]*>, since that would turn

<span title=">">...</span>

into the ill-formed

">...</span>

So you need something like <([^">]|"[^"]*"|'[^']*')*> instead. You can strip out line breaks with character replacement instead of a regex, but if you prefer a regex you can use something like \n (or even combine it with the above using alternation, but that's even less efficient).


Use chilkat html parser chilkat. We used this in my academic project to fetch all the content and hyperlinks from html pages to build a basic search engine.


If the HTML snippet is to be included in a title, you can probably cover all bases with regexes and enough testing.

Still, as a general hint, if you have to handle a larger snippet, I'd go the XML/DOM way with Java, either by parsing with dom4j and grabbing the text or more likely by Stringbuilding the result with a SAX parser.

[EDIT]When I first answered, I was about to write that the HTML must be reasonably well-formed, but assumed you at least a bit of control on the source. If you don't have it, though, I'll just link quickly to JTidy and TagSoup without, of course, having tested either, but they are definitely the first thing I would test to consume real-world HTML with CF.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜