Pattern matching html tags
I'm new to pattern matching, having finally figured it out. I am stuck trying to find an approach to the following problem.
I need to return a match (with php preg_match) if any of a number html tags are present.
<p></p>
<br>
<h1></h1>
<h2></h2&开发者_开发知识库gt;
And return no match match otherwise. So anything not in the above list fails, e.g:
<script></script>
<table></table>
ect
...And ideally I want to operate a white list of safe tags if possible.
Anyone know a pattern that I can use/adapt?
Even though this is not the usual "I want to parse HTML with regular expressions" situation, I would recommend using a DOM parser nevertheless, walk through each element, and abort if it is not in the list of allowed elements.
See e.g. this question to get started.
It could become almost a one-liner using a DOM parser extension like phpQuery if it supports the :not
selector and multiple tag names - I don't know, have never worked with it myself, but it will be easy to find out. Basic examples are here.
preg_match_all('/<([a-z]*)\b[^>]*>(.*?)</\1>/i'$html,$matches);
Breaking down the expression
The first /
is the delimiter
the <
is the start of the tag, the very first <
the ([a-z]*)
starts to match a tag name so fir instance < strong
the \b[^>]*
says once you found a space, keep looking for all words
the >
says it want the previous section to keep looking until it finds the very first >
the (.*?)
says keep on looking and COLLECT ( .. ) the string inside but becuse we have a ?
then stop looking when you find the next char after the closing brace.
the </\1>
says i want to match but only if the value inside is the same as the very first match, this is done by \1
as in match , the value of this would be what's found with
([a-z]*)`.
then you can use preg_match_all to find all them with contents, the array output would be something like
array(
0 > THE WHOLE TAG
1 > TAG NAME
2 > TAG VALUE
)
Hope it helps :)
Exmaple
$allowed = array('b','strong','i','pre','code'); WHITELIST, never blacklist
foreach($matchas as $match)
{
if(!in_array($match[1],$allowed))
{
echo sprintf('The tag %s is disallowed!',$match[1]);
}
}
Regex is utterly unsuited to checking HTML for ‘safe’ tags. Not only that, but there are no safe tags in HTML. Any element can be given attributes that permit script injection (eg. onclick
, style
-with-IE-expression()
...). You must check every attribute as well as every element.
When your security is at stake, you absolutely need a real HTML parser for this (then you filter elements/attributes and serialise the results). There are so many ways to evade regex-based checks it's not even funny.
You can use DOMDocument::loadHTML
followed by a DOM walk to do this, or you could use an existing library such as htmlpurifier.
精彩评论