Matching a string only if it is not in <script> or <a> tags

2023-02-04 16:12 问答作者：

I'm working on a browser plugin that replaces all instances of "someString" (as defined by a complicated regex) with <a href="http://domain.com/$1">$1</a>. This generally works ok just doing a global replace on the body's innerHTML. However it breaks the page when it finds (and replaces) the "someString" inside <script> tags (i.e. as a JS variable or other JS reference). It also breaks if "someString" is already part of an anchor.

So basically I want to do a global replace on all instances of "someString" unless it falls inside a <script></script> or <a></a> tag set.

Essentially what I have now is:

var body = document.getElementsByTagName('body')[0].innerHTML;
body = body.replace(/(someString)/gi, '<a href="http://domain.com/$1">$1</a>');
document.getElementsByTagName('body')[0].innerHTML = body;

But obviously that's not good enough. I've been struggling for a couple hours now and reading all of the answers here (including the many adamant ones that insist regex should not be used with HTML), so I'm open to suggestions on how to do this. I'd prefer using straight JS, but can use jQuery if necessary.

Edit - Sample HTML:

<body>
  someString
  <script type="text/javascript">
  var someString = 'blah';
  console.log(someString);
  </script>
  <a href="someString.html">some开发者_StackOverflow社区String</a>
</body>

In that case, only the very first instance of "someString" should be replaced.

Try this and see if it meets your needs (tested in IE 8 and Chrome).

<script src="jquery-1.4.4.js" type="text/javascript"></script>
<script>
  var pattern = /(someString)/gi;
  var replacement = "<a href=\"http://domain.com/$1\">$1</a>";

  $(function() {
    $("body :not(a,script)")
      .contents()
      .filter(function() { 
        return this.nodeType == 3 && this.nodeValue.search(pattern) != -1;
      })
      .each(function() {
        var span = document.createElement("span");
        span.innerHTML = "&nbsp;" + $.trim(this.nodeValue.replace(pattern, replacement));
        this.parentNode.insertBefore(span, this);
        this.parentNode.removeChild(this);
      });
  });
</script>

The code uses jQuery to find all the text nodes within the document's <body>that are not in <anchor> or <script> blocks, and contain the search pattern. Once those are found, a span is injected containing the target node's modified content, and the old text node is removed.

The only issue I saw was that IE 8 handles text nodes containing only whitespace differently than Chrome, so sometimes a replacement would lose a leading space, hence the insertion of the non-breaking space before the text containing the regex replacements.

Well, You can use XPath with Mozilla (assuming you're writing the plugin for FireFox). The call is document.evaluate. Or you can use an XPath library to do it (there are a few out there)...

var matches = document.evaluate(
    '//*[not(name() = "a") and not(name() = "script") and contains(., "string")]',
    document,
    null,
    XPathResult.UNORDERED_NODE_ITERATOR_TYPE
    null
);

Then replace using a callback function:

var callback = function(node) {
    var text = node.nodeValue;
    text = text.replace(/(someString)/gi, '<a href="http://domain.com/$1">$1</a>');
    var div = document.createElement('div');
    div.innerHTML = text;
    for (var i = 0, l = div.childNodes.length; i < l; i++) {
        node.parentNode.insertBefore(div.childNodes[i], node);
    }
    node.parentNode.removeChild(node);
};
var nodes = [];
//cache the tree since we want to modify it as we iterate
var node = matches.iterateNext();
while (node) {
    nodes.push(node);
    node = matches.iterateNext();
}
for (var key = 0, length = nodes.length; key < length; key++) {
    node = nodes[key];
    // Check for a Text node
    if (node.nodeType == Node.TEXT_NODE) {
        callback(node);
    } else {
        for (var i = 0, l = node.childNodes.length; i < l; i++) {
            var child = node.childNodes[i];
            if (child.nodeType == Node.TEXT_NODE) {
                callback(child);
            }
        }
    }
}

I know you don't want to hear this, but this doesn't sound like a job for a regex. Regular expressions don't do negative matches very well before becoming complicated and unreadable.

Perhaps this regex might be close enough though:

/>[^<]*(someString)[^<]*</

It captures any instance of someString that are inbetween a > and a <.

Another idea is if you do use jQuery, you can use the :contains pseudo-selector.

$('*:contains(someString)').each(function(i)
{
    var markup = $(this).html();
    // modify markup to insert anchor tag
    $(this).html(markup)
});

This will grab any DOM item that contains 'someString' in it's text. I dont think it will traverse <script> tags or so you should be good.

You could try the following:

/(someString)(?![^<]*?(<\/a>|<\/script>))/

I didn't test every schenario, but it is basically using a negative lookahead to look for the next opening bracket following someString, and if that bracket is part of an anchor or script closing tag, it does not match.

Your example seems to work in this fiddle, although it certainly doesn't cover all possibilities. In cases where the innerHTML in your <a></a> contains tags (like <b> or <span>), or the code in your script tags generates html (contains strings with tags in it), you would need something more complex.

继续阅读：dom javascript regex

Matching a string only if it is not in <script> or <a> tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？