A regular expression to clean up XML
I have to deal with XML data that sometimes contains the unescaped ampersand and I can't get the producer to either escape it to & or put it into a CDATA section.
Now I'm looking for a regular expression to replace & with & amp; if its not part of an entity. Something like this: &(?!(amp|apos|quot|lt|gt);)
Unfortunately, my programming environment only support "extended POSIX 1003.2 regular expressions" (see http://www.kernel.org/doc/man-pages/online/pages/man7/regex.7.html) which seem to lack the not operator "!" needed here.
Any ideas how to craft the necessary regular expre开发者_如何转开发ssion ?
Lateral thinking: Replace all &
with &
then replace all &apos
(etc) with &apos
(for example)? You can use a group to capture the part to be put back - &(apos)
Instead of searching for something matching a negative regex you could search for something NOT matching a positive regex, something like:
! ... &(?(amp|apos|quot|lt|gt);)
I did no read the whole page you linked, but am pretty sure it should be possible.
精彩评论