How to match a keyword on a web page that is NOT within an <a> and its href, using JavaScript?

2023-02-06 10:15 问答作者：

I'm searching a page to find a specific keyword. That itself is easy enough. The added complication is that I don't want to match this keyword if it is part of an <a> tag.

E.g.

<p>Here is some example content that has a keyword in it. 
I want to match this keyword here but, i don't want to match 
the <a href="http://www.keyword.com">keyword</a> here.</p>

If you look at the above example content, the word 'keyword' appears 4 times. I want to match the first two times it appears with the paragraph, but i do not want to match it when it appears as part of the href and as part of the <a> content.

So far I've managed to use this below:

var tester = new R开发者_运维问答egExp("((?!<a.*?>)("+keyword+")(?!</a>))", 'ig');

The problem with that above is that it still matches the keyword if it is part of the href.

Any ideas? Thanks

You can't reliably do this with JavaScript regexes. It's hard enough to do with the .NET regex engine that is one of the few to support infinite-length lookbehind assertions, but JavaScript doesn't know lookbehind assertions at all, so you can't look back to see what came before the text you do want to match.

So you should either use a DOM parser (I'm sure someone fluent in JavaScript can suggest a practical approach here), or read the text, remove all the <a> tags (which you sort of could do with a regex, if you're the brave type), and then search for your keyword in the rest of the text.

EDIT:

Well, there is a dirty hack that you could use. It's not pretty, and if you look at Alan Moore's comment to your question, you'll be able to imagine a multitude of ways in which this regex will fail, but it does work on your example:

/keyword(?!(?:(?!<a).)*</a)/

How does it "work"?

keyword    # Match "keyword"
(?!        # but only if it is not possible to match the following regex in the text ahead:
 (?:       # - Match...
  (?!<a)   # -- unless it's the start of an <a> tag...
  .        # -- any character
 )*        # - any number of times
 </a>      # then match a closing <a> tag. 
)          # End of lookahead assertion.

This is quite cryptic, even with the explanation. What it essentially does is:

Match "keyword"
Look ahead that there is no closing </a> in the following text
unless an opening <a> tag comes first.

So if all your <a> tags are correctly balanced, not nested, not found inside comments or script blocks, you might just get away with it.

继续阅读：javascript regex

How to match a keyword on a web page that is NOT within an <a> and its href, using JavaScript?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？