PHP Regex: How to strip all HTML tags but not non HTML tags? [duplicate]
Using PHP regex, how can I removed HTML tags (both opening and closing) and with attributes like <hr class="myclass" />
without removing non HTML tags like <dog>
or <dog class="cat">
?
The non HTML tags are 开发者_C百科dynamic and cannot be hard coded.
Input:
<b><> <<> <dog> <123> <" !> <!--...--> <!doctype> <hr class="myclass" /> </b>
Output should be:
<> <<> <dog> <123> <" !>
I'm considering to use HTML Purifier but first I need to know if this is possible in regex.
HTML Tag reference: http://www.quackit.com/html/tags/
Thanks in advance =)
To match (and remove) start and end tags for HTML 4.01 elements only, the regex in this tested PHP function will do a pretty darn good job:
function strip_HTML_tags($text)
{ // Strips HTML 4.01 start and end tags. Preserves contents.
return preg_replace('%
# Match an opening or closing HTML 4.01 tag.
</? # Tag opening "<" delimiter.
(?: # Group for HTML 4.01 tags.
ABBR|ACRONYM|ADDRESS|APPLET|AREA|A|BASE|BASEFONT|BDO|BIG|
BLOCKQUOTE|BODY|BR|BUTTON|B|CAPTION|CENTER|CITE|CODE|COL|
COLGROUP|DD|DEL|DFN|DIR|DIV|DL|DT|EM|FIELDSET|FONT|FORM|
FRAME|FRAMESET|H\d|HEAD|HR|HTML|IFRAME|IMG|INPUT|INS|
ISINDEX|I|KBD|LABEL|LEGEND|LI|LINK|MAP|MENU|META|NOFRAMES|
NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|P|Q|SAMP|
SCRIPT|SELECT|SMALL|SPAN|STRIKE|STRONG|STYLE|SUB|SUP|S|
TABLE|TD|TBODY|TEXTAREA|TFOOT|TH|THEAD|TITLE|TR|TT|U|UL|VAR
)\b # End group of tag name alternative.
(?: # Non-capture group for optional attribute(s).
\s+ # Attributes must be separated by whitespace.
[\w\-.:]+ # Attribute name is required for attr=value pair.
(?: # Non-capture group for optional attribute value.
\s*=\s* # Name and value separated by "=" and optional ws.
(?: # Non-capture group for attrib value alternatives.
"[^"]*" # Double quoted string.
| \'[^\']*\' # Single quoted string.
| [\w\-.:]+ # Non-quoted attrib value can be A-Z0-9-._:
) # End of attribute value alternatives.
)? # Attribute value is optional.
)* # Allow zero or more attribute=value pairs
\s* # Whitespace is allowed before closing delimiter.
/? # Tag may be empty (with self-closing "/>" sequence.
> # Opening tag closing ">" delimiter.
| <!--.*?--> # Or a (non-SGML compliant) HTML comment.
| <!DOCTYPE[^>]*> # Or a DOCTYPE.
%six', '', $text);
}
CAVEATS: Does not remove scripts <? ... ?>
. Will remove any start or end tags occurring in these structures. Does not correctly parse generalized SGML compliant comments. Does not handle shorttags.
EDIT: Added matching for DOCTYPE and (non-SGML-strict) HTML comments. It now correctly passes the test data in the OP.
EDIT2 The previous version was missing the 's'
single-line modifier. Also added shorttags to caveats list.
Consider using HTML Purifier and turning on the HTML.Proprietary
option, then using the HTML.Allowed
option to expressly whitelist the specific tags and attributes you wish to keep.
Remember, using regular expressions to parse HTML can easily invoke the wrath of Zalgo. Do not taunt Zalgo.
Use a function called strip_tags(). It removes all HTML tags, so it will keep your "custom" tags. If not, tags which you do not wish to remove can be specified.
Another alternative working solution by Dhon:
<?php
$exemption_array = array('<a href"http://www.autopartswarehouse.com/search/?searchType=global&N=0&Ntt=A1327630">');
function strip_HTML_tags_withExemptions( $str , $arrayExemption = array() ){
//Notes $arrayExemption holds all string exemptions in form of tags example <a href"http://www.autopartswarehouse.com/search/?searchType=global&N=0&Ntt=A1327630">
foreach( $arrayExemption as $k => $exemptions )
$str = str_replace($exemptions, " " , $str);
$str = preg_replace("/<\/?(!DOCTYPE|a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdo|big|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|command|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h\d|head|header|hgroup|hr|html|i|iframe|img|input|ins|keygen|kbd|label|legend|li|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|tt|u|ul|var|video|wbr|xmp)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>|<!--(.)*-->/i" , " ", $str);
$str = preg_replace('/\s\s+/', ' ', $str );
$str = preg_replace('/[\.]+/', '.', $str );
return $str;
}
?>
精彩评论