Regex to strip all square brackets except those coming after a certain prefix
So, I have a string. Most of the time, if the string has square brackets in it, bad things will happen. In a few cases, however, it's necessary to keep the brackets. These brackets that need to be kept are identified by a certain prefix. E.g., if the string is:
apple][s [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots [][[]]][]
what I want to turn it into is:
apples pears prefix:[oranges] lemons persimmons peaches apricots
I've come up with a Rube Goldberg mess of a solution, which looks like this:
public class Debracketizer
{
public static void main( String[] args )
{
String orig = "apples [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots";
String result = debracketize(orig);
System.out.println(result);
}
private static void debracketize( String orig )
{
String result1 = replaceAll(orig,
Pattern.compile("\\["),
"",
".*prefix:$");
String result2 = replaceAll(result1,
Pattern.compile("\\]"),
"",
".*prefix:\\[[^\\]]+$");
System.out.println(result2);
}
private static String replaceAll( String orig, Pattern pattern,
String replacement, String skipPattern )
{
String quotedReplacement = Matcher.quoteReplacement(replacement);
Matcher matcher = pattern.matcher(orig);
StringBuffer sb = new S开发者_运维知识库tringBuffer();
while( matcher.find() )
{
String resultSoFar = orig.substring(0, matcher.start());
if (resultSoFar.matches(skipPattern)) {
matcher.appendReplacement(sb, matcher.group());
} else {
matcher.appendReplacement(sb, quotedReplacement);
}
}
matcher.appendTail(sb);
return sb.toString();
}
}
I'm sure there must be a better way to do this -- ideally with one simple regex and one simple String.replaceAll()
. But I haven't been able to come up with it.
(I asked a partial version of this question earlier, but I can't see how to adapt the answer to the full case. Which will teach me to ask partial questions.)
This one liner :
String resultString = subjectString.replaceAll("(?<!prefix:(?:\\[\\w{0,2000000})?)[\\[\\]]", "");
when applied to : apple][s [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots [][[]]][]
will give you the result you seek :
apples pears prefix:[oranges] lemons persimmons peaches apricots
Your only limitation is the maximum number of character that the word between prefix:[] can have. In this case the limit is 2000000. The limitation comes from java since it does not support infinite repetition in negative lookbehind.
Don't go the way of regex, for that path that will forever darken your way. Consider the following or a variation thereof. Split the string based on a reasonable seperator (maybe "prefix[") and be smart about removing the rest of the braces.
Here is an off the cuff algorithm (StringUtils is org.apache.commons.lang.StringUtils):
- Split the string by "prefix[".
StringUtils.splitByWholeSeparator()
appears to be a good candidate for this (in this, the return value is stored in blam). - Strip all "[" chars from the result strings. Maybe do
StringUtils.stripAll(blam)
- For each string in blam do the following:
- If the first string, strip all "]" chars.
StringUtils.strip(blam[0], ']');
. Replace blam[0] with this string. - If not the first string,
- Split the string using the seperator ']' (in this, the return value is stored in kapow).
- Construct a string (named smacky) based on each element of kapow. After adding the 0th element append ']' to smacky.
- replace the string at blam[index] with smacky.
- If the first string, strip all "]" chars.
- Construct the final result by appending all the strings in the blam array.
- Dance a jig of happiness.
Interesting problem. Here is an alternative tested solution which does not use lookbehind.
public class TEST
{
public static void main( String[] args )
{
String orig = "apples [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots";
String result = debracketize(orig);
System.out.println(result);
}
private static String debracketize( String orig )
{
String re = // Don't indent to allow wide regex comments.
"(?x) # Set free-spacing mode. \n" +
"# Either capture (and put back via replace) stuff to be kept... \n" +
" ( # $1: Stuff to be kept. \n" +
" prefix:\\[[^\\[\\]]+\\] # Either the special sequence, \n" +
" | (?: # or... \n" +
" (?! # (Begin negative lookahead.) \n" +
" prefix: # If this is NOT the start \n" +
" \\[[^\\[\\]]+\\] # of the special sequence, \n" +
" ) # (End negative lookahead.) \n" +
" [^\\[\\]] # then match one non-bracket char. \n" +
" )+ # Do this one char at a time. \n" +
" ) # End $1: Stuff to be kept. \n" +
"| # Or... Don't capture stuff to be removed (un-special brackets)\n" +
" [\\[\\]]+ # One or more non-special brackets.";
return orig.replaceAll(re, "$1");
}
}
This method uses two global alternatives. The first alternative captures (and then replaces) the special sequence and non-bracket chars, and the second alternative matches (and removes) the non-special brackets.
If you have a pair of characters that you aren't worried about appearing in the raw (such as <>
), then you can first translate the square brackets you wish to keep into these, strip the remainder, and change the translated brackets back.
Here it is in ruby (porting to java hopefully isn't too hard, you just need a global search-replace with capture groups):
>> s = 'apple][s [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots [][[]]][]'
=> "apple][s [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots [][[]]][]"
>> s.gsub(/([^\[\]]+):\[([^\[\]]+)\]/, '\1:<\2>').gsub(/[\[\]]/,'').gsub(/</,'[').gsub(/>/,']')
=> "apples pears prefix:[oranges] lemons persimmons peaches apricots "
1 find out the match(es) with prefix:\[[^\]]+\]
2 using the same regex to split the string
3 for each array element, remove ] or [ (your example has two elements)
4 join the elements with the result(s) in step 1.
Here's your regex solution:
input.replaceAll("((?<!prefix:)\\[(?!oranges)|(?<!prefix:\\[oranges)\\])", "");
It uses two negative look behinds to prevent the removal of square brackets around the protected prefix. If you wanted to protect several terms, you can do this by changing oranges
to (oranges|apples|pears)
in the regex.
Here's a test using your data:
public static void main(String... args) throws InterruptedException {
String input = "apple][s [pears] prefix:[oranges] lemons ]persimmons[ pea[ches ap]ricots [][[]]][]";
String result = input.replaceAll("((?<!prefix:)\\[(?!oranges)|(?<!prefix:\\[oranges)\\])", "");
System.out.println(result);
}
Output:
apples pears prefix:[oranges] lemons persimmons peaches apricots
精彩评论