Strip all HTML attributes except for src

2023-01-02 11:39 问答作者：

I'm trying to remove all tag attributes except for the src attribute. For example:

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

Would be returned as:

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

I have a regula开发者_JAVA技巧r expression to strip all attributes, but I'm trying to tweak it to leave in src. Here's what I have so far:

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

This might work for your needs:

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

You usually should not parse HTML using regular expressions.

Instead, you should call DOMDocument::loadHTML.
You can then recurse through the elements in the document and call removeAttribute.

Alright, here's what I used that seems to be working well:

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

Feel free to poke any holes in it.

Unfortunately I'm not sure how to answer this question for PHP. If I were using Perl I would do the following:

use strict;
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;

$data =~ s{
    <([^/> ]+)([^>]+)> # split into tagtype, attribs
}{
    my $attribs = $2;
    my @parts = split( /\s+/, $attribs ); # separate by whitespace
    @parts = grep { m/^src=/i } @parts;   # retain just src tags
    if ( @parts ) {
        "<" . join( " ", $1, @parts ) . ">";
    } else {
        "<" . $1 . ">";
    }
}xseg;

print( $data );

which returns

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>

Do not use regex to parse valid html. Use regex to parse an html document ONLY if all available DOM parsers are failing you. I super-love regex, but regex is "DOM-ignorant" and it will quietly fail and/or mutate your document.

I generally prefer a mix of DOMDocument and XPath to concisely, directly, and intuitively target document entities.

With only a couple of minor exceptions, the XPath expression closely resembles its logic in plain English.

//@*[not(name()="src")]

at any level in the document (//)
find any attribute (@*)
satisfying these requirements ([])
that is not (not())
named "src" (name()="src")

This is far more readable, attractive, ad maintainable.

Code: (Demo)

$html = <<<HTML
<p id="paragraph" class="green">
    This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/>
</p>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//@*[not(name()="src")]') as $attr) {
    $attr->parentNode->removeAttribute($attr->nodeName);
}
echo $dom->saveHTML();

Output:

<p>
    This is a paragraph with an image <img src="/path/to/image.jpg">
</p>

If you want to add another exempt attribute, you can use or

//@*[not(name()="src" or name()="href")]

As above introduced you shouldn use regex to parse html, or xml.

I would do your example with str_replace(); if its all time the same.

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

$str = str_replace('id="paragraph" class="green"', "", $str);

$str = str_replace('width="50" height="75"',"",$str);

继续阅读：attributes html-parsing php regex

Strip all HTML attributes except for src

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？