开发者

Regular expression, find word in string, but not surrounded by tag

These code find first occurance $word in $text, and replace it by something:

<?php
  $text = preg_replace("/\b($word)开发者_开发知识库\b/i", 'something', $text, 1);
?>

But i want ignore if this word surrounded by "a" tag, for example, searching should find only second "word" here:

<a href="something">text text word text</a>. text2 text2 word text2...


I think to do this with just a regular expression is possible, but cumbersome. So here's a programmatical way, that is, however, dirty.

I would first replace every occurance of word by an auxiliary string that doesn't occur in the original string (such as e.g. @jska_x). Then I would do a regular expression replacement for @jska_x inside an a-tag in order to restore the words you do not want to replace.

After all, I would replace @jska_x by target_word.


@\b(word\d+)\b(?![^<>]*</|[^><]*>)@i

<a href="something">text text word1 text</a>. text2 \ (cont. on next line)
<a asdasd> text2 word2 text2... fwefw fwe few fw <a>word3</a> \
<a href="/word5.html">asdada</a>

// don't mind the numbers after word. Used them for detection which word matches

Something like this could do the trick, but I advice you not to go with regular expressions on this task. May be you could use DOM and check if word is not in allowed tags, then replace it.


Use a DOM Parser to find all text nodes that contain the needle and which do not have a a parent element with a name of "a":

$html = <<< HTML
<p>
    . text2 text2 word text2...
    <a href="something">text text word <span> word </span> text</a>
    . text2 text2 word text2...
<p>
HTML;

Code:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodes = $xp->query('//*[name() != "a"]/text()[contains(.,"word")]');
foreach($nodes as $node) {
    // can use a Regex in here too if you are after word boundaries
    $node->nodeValue = str_replace('word', 'something', $node->nodeValue);
}
echo $dom->saveXML($dom->documentElement);

Outputs:

<html><body><p>
    . text2 text2 something text2...
    <a href="something">text text word <span> something </span> text</a>
    . text2 text2 something text2...
</p><p/></body></html>

Note how this will also replace word inside the span inside the a. If you want to exclude those too, you have to adjust the XPath to:

'//text()[not(ancestor::a) and contains(., "word")]'

to find all text nodes containing the needle that are not nested anywhere inside an a element.

There is a number of third party parsers worth mentioning that aim to enhance DOM: phpQuery, Zend_Dom, QueryPath and FluentDom.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜