开发者

How to match a keyword on a web page that is NOT within an <a> and its href, using JavaScript?

I'm searching a page to find a specific keyword. That itself is easy enough. The added complication is that I don't want to match this keyword if it is part of an <a> tag.

E.g.

<p>Here is some example content that has a keyword in it. 
I want to match this keyword here but, i don't want to match 
the <a href="http://www.keyword.com">keyword</a> here.</p>

If you look at the above example content, the word 'keyword' appears 4 times. I want to match the first two times it appears with the paragraph, but i do not want to match it when it appears as part of the href and as part of the <a> content.

So far I've managed to use this below:

var tester = new R开发者_运维问答egExp("((?!<a.*?>)("+keyword+")(?!</a>))", 'ig');

The problem with that above is that it still matches the keyword if it is part of the href.

Any ideas? Thanks


You can't reliably do this with JavaScript regexes. It's hard enough to do with the .NET regex engine that is one of the few to support infinite-length lookbehind assertions, but JavaScript doesn't know lookbehind assertions at all, so you can't look back to see what came before the text you do want to match.

So you should either use a DOM parser (I'm sure someone fluent in JavaScript can suggest a practical approach here), or read the text, remove all the <a> tags (which you sort of could do with a regex, if you're the brave type), and then search for your keyword in the rest of the text.

EDIT:

Well, there is a dirty hack that you could use. It's not pretty, and if you look at Alan Moore's comment to your question, you'll be able to imagine a multitude of ways in which this regex will fail, but it does work on your example:

/keyword(?!(?:(?!<a).)*</a)/

How does it "work"?

keyword    # Match "keyword"
(?!        # but only if it is not possible to match the following regex in the text ahead:
 (?:       # - Match...
  (?!<a)   # -- unless it's the start of an <a> tag...
  .        # -- any character
 )*        # - any number of times
 </a>      # then match a closing <a> tag. 
)          # End of lookahead assertion.

This is quite cryptic, even with the explanation. What it essentially does is:

  • Match "keyword"
  • Look ahead that there is no closing </a> in the following text
  • unless an opening <a> tag comes first.

So if all your <a> tags are correctly balanced, not nested, not found inside comments or script blocks, you might just get away with it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜