开发者

Regex challenge: Match phrase only if outside of an <a href> tag

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.

Here is the current code:

     If InStr(ART_ArticleBody, "href") = False then
   sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
   Set rsGlossary = Server.CreateObject("ADODB.Recordset")
   rsGlossary.open sq开发者_JAVA技巧l, strSQLConn
   Set RegExObject = New RegExp
      While Not rsGlossary.EOF
      URL = rsGlossary("URL")
      Phrase = rsGlossary("RegX")
      With RegExObject
     .Pattern = Phrase
     .IgnoreCase = true
     .Global = false
      End With
      set expressionmatch = RegExObject.Execute(ART_ArticleBody)
      if expressionmatch.count > 0 then
      For Each expressionmatched in expressionmatch
      RegExObject.Pattern = Phrase
      URL = "<a href=" & URL & ">"& expressionmatched.Value & "</a>"
     ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
      next
      end if
      rsGlossary.movenext
      wend
      rsGlossary.movefirst
   Set RegExObject = nothing
  end if

Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.

For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return

Here is a link that uses the glossary term: <a href="ROI.htm">Info on return on investment</a>. Now, here is the glossary term in plain text, not inside of a link: return on investment. We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.

In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.

Any help on this would be greatly appreciated.


Try this regex:

<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)

This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.

This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:

<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)


This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:

<a href="ROI.htm">undesired tag match</a>
This is <span class="tag">a tag</span>

In this case, you can simply search:

(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)

Or something a little more robust

(?<=<span class=\"tag\">).+?(?=</span>)

This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.


You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.

Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.


In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)

Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.


(accounts receivable|A/R)(?!((?!</?a\b).)*</a)

(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)

The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜