Strip all HTML attributes except for src
I'm trying to remove all tag attributes except for the src
attribute. For example:
<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>
Would be returned as:
<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>
I have a regula开发者_JAVA技巧r expression to strip all attributes, but I'm trying to tweak it to leave in src
. Here's what I have so far:
<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');
This might work for your needs:
$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';
echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);
// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
(?: # Start Non-Capture Group
[^>]* # Match anything other than '>', Zero or More Times
( # Start Capture Group $2 - ' src="...."'
\s # Match one whitespace
src= # Match 'src='
['"] # Match ' or "
[^'"]* # Match anything other than ' or "
['"] # Match ' or "
) # End Capture Group 2
)? # End Non-Capture Group, match group zero or one time
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $3 - '/' if it is there
> # Match '>'
/i # End Pattern - Case Insensitive
Add some quoting, and use the replacement text <$1$2$3>
it should strip any non src=
properties from well-formed HTML tags.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">">
would end up <p>">
and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP
You usually should not parse HTML using regular expressions.
Instead, you should call DOMDocument::loadHTML
.
You can then recurse through the elements in the document and call removeAttribute
.
Alright, here's what I used that seems to be working well:
<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>
Feel free to poke any holes in it.
Unfortunately I'm not sure how to answer this question for PHP. If I were using Perl I would do the following:
use strict;
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;
$data =~ s{
<([^/> ]+)([^>]+)> # split into tagtype, attribs
}{
my $attribs = $2;
my @parts = split( /\s+/, $attribs ); # separate by whitespace
@parts = grep { m/^src=/i } @parts; # retain just src tags
if ( @parts ) {
"<" . join( " ", $1, @parts ) . ">";
} else {
"<" . $1 . ">";
}
}xseg;
print( $data );
which returns
<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>
Do not use regex to parse valid html. Use regex to parse an html document ONLY if all available DOM parsers are failing you. I super-love regex, but regex is "DOM-ignorant" and it will quietly fail and/or mutate your document.
I generally prefer a mix of DOMDocument and XPath to concisely, directly, and intuitively target document entities.
With only a couple of minor exceptions, the XPath expression closely resembles its logic in plain English.
//@*[not(name()="src")]
- at any level in the document (
//
) - find any attribute (
@*
) - satisfying these requirements (
[]
) - that is not (
not()
) - named "src" (
name()="src"
)
This is far more readable, attractive, ad maintainable.
Code: (Demo)
$html = <<<HTML
<p id="paragraph" class="green">
This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/>
</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//@*[not(name()="src")]') as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
echo $dom->saveHTML();
Output:
<p>
This is a paragraph with an image <img src="/path/to/image.jpg">
</p>
If you want to add another exempt attribute, you can use or
//@*[not(name()="src" or name()="href")]
As above introduced you shouldn use regex to parse html, or xml.
I would do your example with str_replace(); if its all time the same.
$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';
$str = str_replace('id="paragraph" class="green"', "", $str);
$str = str_replace('width="50" height="75"',"",$str);
精彩评论