Regex - nested patterns - within outer pattern but exclude inner pattern

2023-03-11 13:47 问答作者：

I have a file with the content below.

<td> ${ dontReplaceMe } ReplaceMe ${dontReplaceMeEither} </td>

I want to match 'ReplaceMe' if it is in the td tag, but NOT if it is in the ${ ... } expressi开发者_运维问答on.

Can I do this with regex?

Currently have:

sed '/\${.*?ReplaceMe.*?}/!s/ReplaceMe/REPLACED/g' data.txt

This is not possible.

Regex can be used for Type-3 Chomsky languages (regular language).
Your sample code however is a Type-2 Chomsky language (context-free language).

Pretty much as soon as any kind of nesting (brackets) is involved you're dealing with context free languages, which are not covered by regular expressions.

There is basically no way to define within a pair of x and y in a regular expression, as this would require the regular expression to have some kind of stack, which it doesn't (being functionally equivalent to a finite state automaton).

Challenged by brandizzi to find a regex that might match at least trivial cases
I actually came up with this (painfully hacky) regex pattern:

perl -pe 's/(?<=<td>)((?:(?:\{.*?\})*[^{]*?)*)(ReplaceMe)(.*)(?=<\/td>)/$1REPLACED$3/g'

It does proper (sic!) matching for these cases:

<td> ${ dontReplaceMe } ReplaceMe ${dontReplaceMeEither} </td>
<td> ReplaceMe ${dontReplaceMeEither} </td>
<td> ${ dontReplaceMe } ReplaceMe </td>
<td> ReplaceMe </td>

And fails with this one (nesting is Chomsky Type-2, remember? ;) ):

<td>${ ${ dontReplaceMe } ReplaceMe ${dontReplaceMeEither} }</td>

And it can't replace multiple matches either:

<td> ReplaceMe ReplaceMe </td>
<td> ReplaceMe ${dontReplaceMeEither} ReplaceMe </td>

Getting the leading $ covered was the tricky part.
This and keeping Reginald/Reggy from crashing constantly while writing this beast.

AGAIN: EXPERIMENTAL, DO NOT EVER USE THIS IN PRODUCTION CODE!

^{(…or I'll hunt you down, should I ever have to work with your code/app ;)}

Well, for such simple case, you just need to verify that the line does not match ${.*}:

$ sed '/\${.*}/!s/ReplaceMe/REPLACED/' input
<td> REPLACED </td>
<td> ${ don't ReplaceMe } </td>

The ! after the /\${.*}/ sed address negates the criteria.

OTOH, if the case is not that so simple, I'd suspect that your problem will grow a lot and regex will not be the best solution.

usually it is a bad idea to use regex when there is structured markup involved. in some special cases it might be ok, but there are better tools to parse html and then you can use regex on the text nodes.

Something like <td>.*(?<!${).*ReplaceMe(?!.*}).*</td> should work, if grep supports negative lookbehinds (I don't remember if it does).

sed -i 's/<td>\sReplaceMe\s<\/td>/<td>Replaced<\/td>/gi' input.file

worked for me.

you may consider using -i.bak to backup the old file, in case of a mistake.

alternatively,

perl -pi -e 's/<td>\sReplaceMe\s<\/td>/<td>Replaced<\/td>/g' temp

also works, again, note the -pi.bak to backup.

继续阅读：bash grep pattern-matching regex sed

Regex - nested patterns - within outer pattern but exclude inner pattern

AGAIN: EXPERIMENTAL, DO NOT EVER USE THIS IN PRODUCTION CODE!

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

AGAIN: EXPERIMENTAL, DO NOT EVER USE THIS IN PRODUCTION CODE!

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？