Pattern matching html tags

2023-01-13 22:57 问答作者：

I'm new to pattern matching, having finally figured it out. I am stuck trying to find an approach to the following problem.

I need to return a match (with php preg_match) if any of a number html tags are present.

<p></p>
<br>
<h1></h1>
<h2></h2&开发者_开发知识库gt;

And return no match match otherwise. So anything not in the above list fails, e.g:

<script></script>
<table></table>

ect

...And ideally I want to operate a white list of safe tags if possible.

Anyone know a pattern that I can use/adapt?

Even though this is not the usual "I want to parse HTML with regular expressions" situation, I would recommend using a DOM parser nevertheless, walk through each element, and abort if it is not in the list of allowed elements.

See e.g. this question to get started.

It could become almost a one-liner using a DOM parser extension like phpQuery if it supports the :not selector and multiple tag names - I don't know, have never worked with it myself, but it will be easy to find out. Basic examples are here.

preg_match_all('/<([a-z]*)\b[^>]*>(.*?)</\1>/i'$html,$matches);

Breaking down the expression

The first / is the delimiter

the < is the start of the tag, the very first <

the ([a-z]*) starts to match a tag name so fir instance < strong

the \b[^>]* says once you found a space, keep looking for all words

the > says it want the previous section to keep looking until it finds the very first >

the (.*?) says keep on looking and COLLECT ( .. ) the string inside but becuse we have a ? then stop looking when you find the next char after the closing brace.

the </\1> says i want to match but only if the value inside is the same as the very first match, this is done by \1 as in match , the value of this would be what's found with([a-z]*)`.

then you can use preg_match_all to find all them with contents, the array output would be something like

array(
    0 > THE WHOLE TAG
    1 > TAG NAME
    2 > TAG VALUE
)

Hope it helps :)

Exmaple

$allowed = array('b','strong','i','pre','code'); WHITELIST, never blacklist
foreach($matchas as $match)
{
    if(!in_array($match[1],$allowed))
    {
        echo sprintf('The tag %s is disallowed!',$match[1]);
    }
}

Regex is utterly unsuited to checking HTML for ‘safe’ tags. Not only that, but there are no safe tags in HTML. Any element can be given attributes that permit script injection (eg. onclick, style-with-IE-expression()...). You must check every attribute as well as every element.

When your security is at stake, you absolutely need a real HTML parser for this (then you filter elements/attributes and serialise the results). There are so many ways to evade regex-based checks it's not even funny.

You can use DOMDocument::loadHTML followed by a DOM walk to do this, or you could use an existing library such as htmlpurifier.

继续阅读：php regex

Pattern matching html tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？