开发者

Remove WhiteSpace Chars from String instance

is there another way how t开发者_StackOverflow中文版o remove WhiteSpace Char(s) from String

1) other as I know

myString.trim()

Pattern.compile("\\s");

2) is there another reason(s) search/look for an another/different method as I using


Guava has a preconfigured CharMatcher for whitespace(). It works with unicode as well.

Sample usage:

System.out.println(CharMatcher.whitespace().removeFrom("H \ne\tl\u200al \to   "));

Output:

Hello

The CharMatcher also has many other nice features, one of my favorites is the collapseFrom() method, which replaces multiple occurences with a single character:

System.out.println(
    CharMatcher.whitespace().collapseFrom("H \ne\tl\u200al \to   ", '*'));

Output:

Hello*


You can simply use myString.replaceAll("\\s", ""). But:

  • note the comment about unicode whitespaces
  • the above will remove newlines. If you don't want newlines removed, exclude them from the regex.


The reason to keep looking for different techniques is to find one that does what you really want. For example, trim() only removes the whitespace from the beginning and end of the string. To get the same effect with a regex, you have to do something like this:

s = s.replaceAll("^\\s+|\\s+$", "");

And then there's the matter of exactly which characters are removed. Pre-Java 7, \s matches only ASCII whitespace characters, i.e.:

"[\\u0009\\u000A\\u000B\\u000C\\u000D\\u0020]"

...while (as Peter observed) trim() simple-mindedly removes all characters at or below codepoint 32 (U+0020 in Unicode notation). I suspect the thinking here was that the other characters are extremely unlikely to appear in a string anyway, and if they do, you probably want to get rid of them. (It works for me, anyway. ☺) But it's something you should be aware of. Here's some code that demonstrates the difference between trim() and the regex approach:

String s = "\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007"
         + "\u0008\u0009\n\u000B\u000C\r\u000E\u000F"
         + "\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017"
         + "\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F"
         + "\u0020\u00A0";
System.out.println(s.length());
System.out.println(s.trim().length());
System.out.println(s.replaceAll("\\s", "").length());

output:

34
1
28

The one remaining character in the second line of output is a non-breaking space (U+00A0, or "NBSP" henceforth). There are a lot more whitespace characters once you get outside the ASCII range, but the one you're most likely to encounter is the NBSP. Neither trim() nor the regex removed it, but watch what happens when you change the last line of code to this:

System.out.println(s.replaceAll("(?U)\\s", "").length());

...and run it under Java 7:

34
1
27

By adding the (?U), I turned on UNICODE_CHARACTER_CLASSES mode, as mentioned by @tchrist in his comment. NBSP is a whitespace character, no matter what Character.isWhitespace() says, but that doesn't mean you'll always want to include it in your whitespace matches. That's why Guava (mentioned by @Sean) also includes a BREAKING_WHITESPACE CharMatcher.

In sum, to choose the right tool for removing whitespace, you need to know exactly which whitespace characters you want to remove, and exactly where you want to remove them from. It's not all that complicated, but it's not as simple as legacy tools like trim() and StringTokenizer pretend it is, either.


Trim removes leading and trailing characters between ASCII 0 and ASCII 32. This happens to remove most ASCII whitespaces but also removes all control characters. It doesn't remove them inside the String either.

for(int i=Character.MIN_CODE_POINT;i<=Character.MAX_CODE_POINT;i++)
  if(Character.isWhitespace(i))
    System.out.println(i);

prints

9 10 11 12 13 28 29 30 31 32 5760 6158 8192 8193 8194 8195 8196 8197 8198 8200 8201 8202 8232 8233 8287 12288


I was retyping some code from C# to Java - I needed to simulate XmlNode.OuterXml and XmlNode.InnerXml. For this I have used Transformer, however from some reason it does not recognize some whitespaces correctly even if you turn indentation off. So my other choice was to postprocess string containing carriage returns, linefeeds and tabs by regex using one of these two equivalent calls:

string.replaceAll("[\t\n\b\r\f]+ *", "");   
string.replaceAll("[\\s+ *", "");

both of these remove any whitespaces in a string and tab spaces as well. Hope it is at least little bit relevant. Second one is probably better choice


String.replace(" ","");

(2) perhaps for performance tunning, other than that, I dunno

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜