Removing all tags except a few whitelisted ones with a regular expression
I have a text with some HTML-like tags, which I would like to remove. I only want to allow about a dozen whitelisted tags, like <b> or <i>. I can't use PHP's strip tags, as I need a more general solution using regular expressions (as some of my other tags use d开发者_Go百科ifferent conventions, for example [tag] instead of <tag>). How do achieve this effect?
The regular expression I use right now is:
return preg_replace('/ \<[^\>]+\>/', '', $text);
How should I change it to exclude the tags I mentioned? I looked through similar questions but they don't provide a solution to the specific problem I mentioned here.
If you can't use PHP's strip_tags()
, use HTMLPurifier, which will allow you to implement all sorts of rules, safely.
To answer your question anyway, you could use an assertion (?!..)
to exclue things from matching:
preg_replace('#<(?!/?(a|b|i|div)\b)[^>]+>#'
But take in mind that this is not a very reliable approach. Filtering tag names is the easy part. For a complete sanitization you'd have to clean up attributes, where it becomes complicated. Try HTMLPurifier, which already contains heaps of regular expressions to do so.
$wl = '(?!(?:b|tr|td)\b)'; // whitelist in group
$rxtags = '
<
(?:
(?:
(?:
(?:' ."$wl". 'script|' ."$wl". 'style) \s*
| (?:' ."$wl". 'script|' ."$wl". 'style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
)> .*? </(?:' ."$wl". 'script|' ."$wl". 'style)\s*
)
|
(?:
/?' ."$wl". '\w+\s*/?
| ' ."$wl". '\w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
| !(?:DOCTYPE.*?|--.*?--)
)
)
>';
s/$rxtags//xsg
"/$rxtags/xs"
, modifiers: expanded, span, globally
And change ' . "$wl" . '
to ' + "$wl" + '
or however catenation is done in php.
精彩评论