What is the Java regex to use back-references and capture groups correctly
I want to strip a SOAP envelope from a messag开发者_Go百科e to get at the XML in the body.
I attempted the following;
String strippedOfEnvelopedHeader = msg.replaceAll("(?s)(?i)<(.*):Envelope.*<\1:Body>", "");
I thought that this would stip out the SOAP envelope, specifically the header, from a message like;
<soapenv:Envelope xmlns:soapenv='http://schemas.xmlsoap.org/soap/envelope/'>
<env:Header xmlns:env='http://schemas.xmlsoap.org/soap/envelope/' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'/>
<soapenv:Body>
<myXML> stuff is here</myXML>
</soapenv:Body>
</soapenv:Envelope>
which should result in;
<myXML> stuff is here</myXML>
</soapenv:Body>
</soapenv:Envelope>
However, the group back-reference does not seem to work.
If I replace both the capture group and the back-reference the substitution works fine;
String strippedOfEnvelopeHeader = msg.replaceAll("(?i)(?s)<soapenv:Envelope.*<soapenv:Body>", "");
I think I can guess the problem, the capture group is being greedy and grabbing the entire message and thus failing the match.
But the solution evades me.
Any ideas?
Try 2 backslashes
"(?si)<(.*):Envelope.*<\\1:Body>"
You need 2 because \1
itself is already a special escape sequence to Java. Therefore it will be decoded into the character U+0001 before feeding to the regex engine. You need to protect it by adding one more backslash.
(And the usual "don't parse XML with Regex" warning follows...)
Try this:
String strippedOfEnvelopedHeader = msg.replaceAll("(?s)<(\\w+):Envelope[^<>]*>.*?<\\1:Body>", "");
Key points:
- As already pointed out by others, backslashes in Java strings need to be escaped. So every backslash in your regex becomes a double backslash when formatting the regex as a Java string.
- You're using the dot inappropriately. You cannot have any character as the XML namespace. You cannot have any character inside XML tags. Make your regex more specific by using (negated) character classes, and you'll easily avoid problems with
.*
eating up more than it should. I left one.*?
in my regex because I don't know the structure of all the other text you'll be using this regex with. But if it will always have the one<env:Header>
element, then you should replace the.*?
in my regex with\s*<env:Header[^<>]*>\s*
or whatever is sufficiently specific to avoid runaway matches while still matching everything you want to.
If you want to remove the closing tags too, try this:
String strippedOfEnvelopedHeader = msg.replaceAll("(?s)<(\\w+):Envelope[^<>]*>.*?<\\1:Body>\\s*(.*?)\\s*</\\1:Body>\\s*</\\1:Envelope>", "$2");
In this regex, the second .*?
is appropriate if you want to remove the tags regardless of what is inside them.
As a side, why don't you try to get rid of the entire soap message wrapper?
String strippedOfEnveloped = msg.replace( "^ (?six) < (.*):Envelope .* <\\1:Body> (.*) </\\1:Body> .* $", "\\2" );
精彩评论