Counting words with regular expression "\S+"
Why does wordCount
end up being 1, rather than 5, in the code below?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordCount {
public static void main(String[] args) {
final Pattern wordCountRegularExpression = Pattern.compile("\\S+");
final Matcher matcher = wordCountRegularExpression
.matcher("one two three four five");
int wordCount = 0;
while (matcher.find()) {
wordCount++;
}
System.out.println("wordCount: " + wordCount);
}
}
Doesn't the patter开发者_开发问答n "\S+" match a word, since it means one or more non-space characters?
This does work by the way:
final Pattern wordCountRegularExpression = Pattern.compile("\\b\\w+\\b");
But I still don't understand why the original code doesn't work.
Doesn't the pattern "\S+" match a word, since it means one or more non-space characters?
Yes.
Using
import java.util.regex.*;
in java 7, the following pattern:
Pattern.compile("\\S+");
Will not count word, but spaces.
So, it should return 4 for the input: "one two three four five", since it have 4 spaces.
It depends on what you're using to separate the words. When I copy the code from your question into my editor, I see plain old spaces (U+0020
), but when I viewsource the page I see non-breaking spaces (U+00A0
). Java doesn't recognize the NBSP as a whitespace character.
Now the question is why am I seeing NBSP's in the string literal, but nowhere else? And why are they being converted to spaces when I copy/paste? Is anyone else seeing that?
精彩评论