Regex challenge: Match phrase only if outside of an <a href> tag

2022-12-08 03:08 问答作者：

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.

Here is the current code:

     If InStr(ART_ArticleBody, "href") = False then
   sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
   Set rsGlossary = Server.CreateObject("ADODB.Recordset")
   rsGlossary.open sq开发者_JAVA技巧l, strSQLConn
   Set RegExObject = New RegExp
      While Not rsGlossary.EOF
      URL = rsGlossary("URL")
      Phrase = rsGlossary("RegX")
      With RegExObject
     .Pattern = Phrase
     .IgnoreCase = true
     .Global = false
      End With
      set expressionmatch = RegExObject.Execute(ART_ArticleBody)
      if expressionmatch.count > 0 then
      For Each expressionmatched in expressionmatch
      RegExObject.Pattern = Phrase
      URL = "<a href=" & URL & ">"& expressionmatched.Value & "</a>"
     ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
      next
      end if
      rsGlossary.movenext
      wend
      rsGlossary.movefirst
   Set RegExObject = nothing
  end if

Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.

For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return

Here is a link that uses the glossary term: <a href="ROI.htm">Info on return on investment</a>. Now, here is the glossary term in plain text, not inside of a link: return on investment. We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.

In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.

Any help on this would be greatly appreciated.

Try this regex:

<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)

This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.

This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:

<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)

This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:

<a href="ROI.htm">undesired tag match</a>
This is <span class="tag">a tag</span>

In this case, you can simply search:

(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)

Or something a little more robust

(?<=<span class=\"tag\">).+?(?=</span>)

This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.

You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.

Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.

In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)

Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.

(accounts receivable|A/R)(?!((?!</?a\b).)*</a)

(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)

The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.

继续阅读：asp-classic regex

Regex challenge: Match phrase only if outside of an <a href> tag

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？