PHP Regex: How to strip all HTML tags but not non HTML tags? [duplicate]

2023-02-18 08:39 问答作者：

This question already has answers here: What to do Regular expression pattern doesn't match anywhere in string? (8 answers) Closed 8 years ago.

Using PHP regex, how can I removed HTML tags (both opening and closing) and with attributes like <hr class="myclass" /> without removing non HTML tags like <dog> or <dog class="cat">?

The non HTML tags are 开发者_C百科dynamic and cannot be hard coded.

Input:

<b><> <<> <dog> <123> <" !> <!--...--> <!doctype> <hr class="myclass" /> </b>

Output should be:

<> <<> <dog> <123> <" !>

I'm considering to use HTML Purifier but first I need to know if this is possible in regex.

HTML Tag reference: http://www.quackit.com/html/tags/

Thanks in advance =)

To match (and remove) start and end tags for HTML 4.01 elements only, the regex in this tested PHP function will do a pretty darn good job:

function strip_HTML_tags($text)
{ // Strips HTML 4.01 start and end tags. Preserves contents.
    return preg_replace('%
        # Match an opening or closing HTML 4.01 tag.
        </?                  # Tag opening "<" delimiter.
        (?:                  # Group for HTML 4.01 tags.
          ABBR|ACRONYM|ADDRESS|APPLET|AREA|A|BASE|BASEFONT|BDO|BIG|
          BLOCKQUOTE|BODY|BR|BUTTON|B|CAPTION|CENTER|CITE|CODE|COL|
          COLGROUP|DD|DEL|DFN|DIR|DIV|DL|DT|EM|FIELDSET|FONT|FORM|
          FRAME|FRAMESET|H\d|HEAD|HR|HTML|IFRAME|IMG|INPUT|INS|
          ISINDEX|I|KBD|LABEL|LEGEND|LI|LINK|MAP|MENU|META|NOFRAMES|
          NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|P|Q|SAMP|
          SCRIPT|SELECT|SMALL|SPAN|STRIKE|STRONG|STYLE|SUB|SUP|S|
          TABLE|TD|TBODY|TEXTAREA|TFOOT|TH|THEAD|TITLE|TR|TT|U|UL|VAR
        )\b                  # End group of tag name alternative.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        /?                   # Tag may be empty (with self-closing "/>" sequence.
        >                    # Opening tag closing ">" delimiter.
        | <!--.*?-->         # Or a (non-SGML compliant) HTML comment.
        | <!DOCTYPE[^>]*>    # Or a DOCTYPE.
        %six', '', $text);
}

CAVEATS: Does not remove scripts <? ... ?>. Will remove any start or end tags occurring in these structures. Does not correctly parse generalized SGML compliant comments. Does not handle shorttags.

EDIT: Added matching for DOCTYPE and (non-SGML-strict) HTML comments. It now correctly passes the test data in the OP.

EDIT2 The previous version was missing the 's' single-line modifier. Also added shorttags to caveats list.

Consider using HTML Purifier and turning on the HTML.Proprietary option, then using the HTML.Allowed option to expressly whitelist the specific tags and attributes you wish to keep.

Remember, using regular expressions to parse HTML can easily invoke the wrath of Zalgo. Do not taunt Zalgo.

Use a function called strip_tags(). It removes all HTML tags, so it will keep your "custom" tags. If not, tags which you do not wish to remove can be specified.

Another alternative working solution by Dhon:

<?php
$exemption_array = array('<a href"http://www.autopartswarehouse.com/search/?searchType=global&N=0&Ntt=A1327630">');
function strip_HTML_tags_withExemptions( $str , $arrayExemption = array() ){
     //Notes $arrayExemption holds all string exemptions in form of tags example <a href"http://www.autopartswarehouse.com/search/?searchType=global&N=0&Ntt=A1327630">
    foreach( $arrayExemption as $k => $exemptions )
        $str = str_replace($exemptions, " " , $str);
    $str = preg_replace("/<\/?(!DOCTYPE|a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdo|big|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|command|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h\d|head|header|hgroup|hr|html|i|iframe|img|input|ins|keygen|kbd|label|legend|li|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|tt|u|ul|var|video|wbr|xmp)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>|<!--(.)*-->/i" , " ", $str);
    $str = preg_replace('/\s\s+/', ' ', $str );
    $str = preg_replace('/[\.]+/', '.', $str );
    return $str;
}
?>

继续阅读：php regex string

PHP Regex: How to strip all HTML tags but not non HTML tags? [duplicate]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？