开发者

Regex - nested patterns - within outer pattern but exclude inner pattern

I have a file with the content below.

<td> ${ dontReplaceMe } ReplaceMe ${dontReplaceMeEither} </td>

I want to match 'ReplaceMe' if it is in the td tag, but NOT if it is in the ${ ... } expressi开发者_运维问答on.

Can I do this with regex?

Currently have:

sed '/\${.*?ReplaceMe.*?}/!s/ReplaceMe/REPLACED/g' data.txt


This is not possible.

Regex can be used for Type-3 Chomsky languages (regular language).
Your sample code however is a Type-2 Chomsky language (context-free language).

Pretty much as soon as any kind of nesting (brackets) is involved you're dealing with context free languages, which are not covered by regular expressions.

There is basically no way to define within a pair of x and y in a regular expression, as this would require the regular expression to have some kind of stack, which it doesn't (being functionally equivalent to a finite state automaton).


Challenged by brandizzi to find a regex that might match at least trivial cases
I actually came up with this (painfully hacky) regex pattern:

perl -pe 's/(?<=<td>)((?:(?:\{.*?\})*[^{]*?)*)(ReplaceMe)(.*)(?=<\/td>)/$1REPLACED$3/g'

It does proper (sic!) matching for these cases:

<td> ${ dontReplaceMe } ReplaceMe ${dontReplaceMeEither} </td>
<td> ReplaceMe ${dontReplaceMeEither} </td>
<td> ${ dontReplaceMe } ReplaceMe </td>
<td> ReplaceMe </td>

And fails with this one (nesting is Chomsky Type-2, remember? ;) ):

<td>${ ${ dontReplaceMe } ReplaceMe ${dontReplaceMeEither} }</td>

And it can't replace multiple matches either:

<td> ReplaceMe ReplaceMe </td>
<td> ReplaceMe ${dontReplaceMeEither} ReplaceMe </td>

Getting the leading $ covered was the tricky part.
This and keeping Reginald/Reggy from crashing constantly while writing this beast.

AGAIN: EXPERIMENTAL, DO NOT EVER USE THIS IN PRODUCTION CODE!

(…or I'll hunt you down, should I ever have to work with your code/app ;)


Well, for such simple case, you just need to verify that the line does not match ${.*}:

$ sed '/\${.*}/!s/ReplaceMe/REPLACED/' input
<td> REPLACED </td>
<td> ${ don't ReplaceMe } </td>

The ! after the /\${.*}/ sed address negates the criteria.

OTOH, if the case is not that so simple, I'd suspect that your problem will grow a lot and regex will not be the best solution.


usually it is a bad idea to use regex when there is structured markup involved. in some special cases it might be ok, but there are better tools to parse html and then you can use regex on the text nodes.


Something like <td>.*(?<!${).*ReplaceMe(?!.*}).*</td> should work, if grep supports negative lookbehinds (I don't remember if it does).


sed -i 's/<td>\sReplaceMe\s<\/td>/<td>Replaced<\/td>/gi' input.file

worked for me.

you may consider using -i.bak to backup the old file, in case of a mistake.

alternatively,

perl -pi -e 's/<td>\sReplaceMe\s<\/td>/<td>Replaced<\/td>/g' temp

also works, again, note the -pi.bak to backup.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜