Remove all attributes from html tags

2023-01-03 08:30 问答作者：

<p style="padding:0px;">
  <strong style="padding:0;margin:0;">hello</strong>
</p>

How can I remove attributes from all tags? I'd like it to look like this:

<p>
  <strong>hello</strong>
</p>

Adapted from my answer on a similar question

$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';

echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);

// <p><strong>hello</strong></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]        # Match 'a' through 'z'
  [a-z0-9]*    # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $2 - '/' if it is there
 >             # Match '>'
/is            # End Pattern - Case Insensitive & Multi-line ability

Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably "> would end up "> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP

Here is how to do it with native DOM:

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html);                  // load HTML into it
$xpath = new DOMXPath($dom);            // create a new XPath
$nodes = $xpath->query('//*[@style]');  // Find elements with a style attribute
foreach ($nodes as $node) {              // Iterate over found elements
    $node->removeAttribute('style');    // Remove style attribute
}
echo $dom->saveHTML();                  // output cleaned HTML

If you want to remove all possible attributes from all possible tags, do

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
    $node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();

I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM

You can get a list of attributes that the object has by using attr. For example:

$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
  ["id"]=>
  string(5) "hello"
}
*/

foreach ( $html->find("div", 0)->attr as &$value ){
    $value = null;
}

print $html
//<div>World</div>

$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;

// Result
string 'Hello <b>world</b>. Its beautiful day.'

Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.

Here is a simple example:

function scrubAttributes($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
        for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
            $els->item($i)->removeAttribute($attrs->item($ii)->name);
        }
    }
    return $dom->saveHTML();
}

Here's a demo: https://3v4l.org/M2ing

Optimized regular expression from the top rated answer on this issue:

$text = '<div width="5px">a is less than b: a<b, ya know?</div>';

echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);

// <div>a is less than b: a<b, ya know?</div>

UPDATE:

It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only  ,  and  tags, then:

$text = '<i style=">">Italic</i>';

$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);

//<i>Italic</i>

As we can see it fixes flaws connected with tag symbols in attribute values.

Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:

echo preg_replace(
    "|<(\w+)([^>/]+)?|",
    "<$1",
    "<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);

Update

Make to second capture optional and do not strip '/' from closing tags:

|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|

Demonstrate this regular expression works:

$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>

Hope this helps. It may not be the fastest way to do it, especially for large blocks of html. If anyone has any suggestions as to make this faster, let me know.

function StringEx($str, $start, $end)
{ 
    $str_low = strtolower($str);
    $pos_start = strpos($str_low, $start);
    $pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
    if($pos_end==0) return false;
    if ( ($pos_start !== false) && ($pos_end !== false) )
    {  
        $pos1 = $pos_start + strlen($start);
        $pos2 = $pos_end - $pos1;
        $RData = substr($str, $pos1, $pos2);
        if($RData=='') { return true; }
        return $RData;
    } 
    return false;
}

$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);

To do SPECIFICALLY what andufo wants, it's simply:

$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );

That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.

Here's an easy way to get rid of attributes. It handles malformed html pretty well.

<?php
  $string = '<p style="padding:0px;">
    <strong style="padding:0;margin:0;">hello</strong>
    </p>';

  //get all html elements on a line by themselves
  $string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string); 

  //find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
  $string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);

  echo $string_attribute_free;
?>

继续阅读：php

Remove all attributes from html tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？