Remove WhiteSpace Chars from String instance

2023-04-01 11:41 问答作者：

is there another way how t开发者_StackOverflow中文版o remove WhiteSpace Char(s) from String

1) other as I know

myString.trim()

Pattern.compile("\\s");

2) is there another reason(s) search/look for an another/different method as I using

Guava has a preconfigured CharMatcher for whitespace(). It works with unicode as well.

Sample usage:

System.out.println(CharMatcher.whitespace().removeFrom("H \ne\tl\u200al \to   "));

Output:

Hello

The CharMatcher also has many other nice features, one of my favorites is the collapseFrom() method, which replaces multiple occurences with a single character:

System.out.println(
    CharMatcher.whitespace().collapseFrom("H \ne\tl\u200al \to   ", '*'));

Output:

Hello*

You can simply use myString.replaceAll("\\s", ""). But:

note the comment about unicode whitespaces
the above will remove newlines. If you don't want newlines removed, exclude them from the regex.

The reason to keep looking for different techniques is to find one that does what you really want. For example, trim() only removes the whitespace from the beginning and end of the string. To get the same effect with a regex, you have to do something like this:

s = s.replaceAll("^\\s+|\\s+$", "");

And then there's the matter of exactly which characters are removed. Pre-Java 7, \s matches only ASCII whitespace characters, i.e.:

"[\\u0009\\u000A\\u000B\\u000C\\u000D\\u0020]"

...while (as Peter observed) trim() simple-mindedly removes all characters at or below codepoint 32 (U+0020 in Unicode notation). I suspect the thinking here was that the other characters are extremely unlikely to appear in a string anyway, and if they do, you probably want to get rid of them. (It works for me, anyway. ☺) But it's something you should be aware of. Here's some code that demonstrates the difference between trim() and the regex approach:

String s = "\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007"
         + "\u0008\u0009\n\u000B\u000C\r\u000E\u000F"
         + "\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017"
         + "\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F"
         + "\u0020\u00A0";
System.out.println(s.length());
System.out.println(s.trim().length());
System.out.println(s.replaceAll("\\s", "").length());

output:

34
1
28

The one remaining character in the second line of output is a non-breaking space (U+00A0, or "NBSP" henceforth). There are a lot more whitespace characters once you get outside the ASCII range, but the one you're most likely to encounter is the NBSP. Neither trim() nor the regex removed it, but watch what happens when you change the last line of code to this:

System.out.println(s.replaceAll("(?U)\\s", "").length());

...and run it under Java 7:

34
1
27

By adding the (?U), I turned on UNICODE_CHARACTER_CLASSES mode, as mentioned by @tchrist in his comment. NBSP is a whitespace character, no matter what Character.isWhitespace() says, but that doesn't mean you'll always want to include it in your whitespace matches. That's why Guava (mentioned by @Sean) also includes a BREAKING_WHITESPACE CharMatcher.

In sum, to choose the right tool for removing whitespace, you need to know exactly which whitespace characters you want to remove, and exactly where you want to remove them from. It's not all that complicated, but it's not as simple as legacy tools like trim() and StringTokenizer pretend it is, either.

Trim removes leading and trailing characters between ASCII 0 and ASCII 32. This happens to remove most ASCII whitespaces but also removes all control characters. It doesn't remove them inside the String either.

for(int i=Character.MIN_CODE_POINT;i<=Character.MAX_CODE_POINT;i++)
  if(Character.isWhitespace(i))
    System.out.println(i);

prints

9 10 11 12 13 28 29 30 31 32 5760 6158 8192 8193 8194 8195 8196 8197 8198 8200 8201 8202 8232 8233 8287 12288

I was retyping some code from C# to Java - I needed to simulate XmlNode.OuterXml and XmlNode.InnerXml. For this I have used Transformer, however from some reason it does not recognize some whitespaces correctly even if you turn indentation off. So my other choice was to postprocess string containing carriage returns, linefeeds and tabs by regex using one of these two equivalent calls:

string.replaceAll("[\t\n\b\r\f]+ *", "");   
string.replaceAll("[\\s+ *", "");

both of these remove any whitespaces in a string and tab spaces as well. Hope it is at least little bit relevant. Second one is probably better choice

String.replace(" ","");

(2) perhaps for performance tunning, other than that, I dunno

继续阅读：regex string trim

Remove WhiteSpace Chars from String instance

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？