Splitting strings through regular expressions by punctuation and whitespace etc in java

2023-04-04 00:15 问答作者：

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

String.split([\\p{Punct}\\开发者_StackOverflows+])"

But I know I am missing out on some words from the text file. For example, the word "can't" should be divided into two words "can" and "t".

Commas and other punctuation should be completely ignored and considered as whitespace. I have been trying to understand how to form a more precise Regular Expression to do this but I am a novice when it comes to this so I need some help.

What could be a better regex for the purpose I have described?

You have one small mistake in your regex. Try this:

String[] Res = Text.split("[\\p{Punct}\\s]+");

[\\p{Punct}\\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

So I get for this code

String Text = "But I know. For example, the word \"can\'t\" should";

String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
    System.out.println(s);
}

this result

10
But
I
know
For
example
the
word
can
t
should

Which should meet your requirement.

As an alternative you can use

String[] Res = Text.split("\\P{L}+");

\\P{L} means is not a unicode code point that has the property "Letter"

There's a non-word literal, \W, see Pattern.

String line = "Hello! this is a line. It can't be hard to split into \"words\", can it?";
String[] words = line.split("\\W+");
for (String word : words) System.out.println(word);

gives

Hello
this
is
a
line
It
can
t
be
hard
to
split
into
words
can
it

Well, seeing you want to count can't as two words , try

split("\\b\\w+?\\b")

http://www.regular-expressions.info/wordboundaries.html

Try:

line.split("[\\.,\\s!;?:\"]+");
or         "[\\.,\\s!;?:\"']+"

This is an or match of one of these characters: ., !;?:"' (note that there is a space in there but no / or \) the + causes several chars together to be counted as one.

That should give you a mostly sufficient accuracy. More precise regexes would need more information about the type of text you need to parse, because ' can be a word delimiter as well. Mostly the most punctuation word delimiters are around a whitespace so matching on [\\s]+ would be a close approximation as well. (but gives the wrong count on short quotations like: She said:"no".)

If you come here from Kotlin sentence.split(Regex("[\\p{Punct}\\s]+"))

继续阅读：regex split string

Splitting strings through regular expressions by punctuation and whitespace etc in java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？