Regular expression to match a certain HTML element

2023-01-18 19:27 问答作者：

I'm trying to write a regular expression for matching the following HTML.

<span class="hidden_text">Some text here.</span>

I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.

$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";

If anyone could highlight what I'm doing wron开发者_StackOverflowg that would be great.

You need to use a non-greedy selection by adding ? after .* :

$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";

Note : If you need to match generic HTML, you should use a XML parser like DOM.

You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.

See the following questions for further information on how to do that with PHP:

How to parse HTML with PHP?
Best methods to parse HTML

$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";

I got it. ;)

Chances are that you have multiple spans, and the regexp you're using will default to greedy mode

It's a lot easier using PHP's DOM Parser to extract content from HTML

I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:

"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"

...and this one:

'~<span class="hidden_text">[^><]++</span>~'

PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.

Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.

继续阅读：html-parsing pcre php regex

Regular expression to match a certain HTML element

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？