regular expression that extracts words from a string

2023-01-05 06:53 问答作者：

I want to extract all words from a java Stri开发者_运维技巧ng.

word can be written in any european language, and does not contain spaces, only alpha symbols.

it can contain hyphens though.

If you aren't tied to regular expressions, also have a look at BreakIterator, in particular the getWordInstance() method:

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.

You can use a variation of (?<!\S)\S+(?!\S), i.e. any maximal sequence of non-whitespace characters.

Negative lookarounds are used so that it can match "words" at the beginning and end of string
Substitute your own character class for \S to look for something more specific
- (e.g. [A-Za-z-], etc)

Here's a simple example to illustrate the idea, using [a-z-] as the alphabet character class:

    String text = "--xx128736f-afasdf2137asdf-12387-kjs-23xx--";
    Pattern p = Pattern.compile(
        "(?<!alpha)alpha+(?!alpha)".replace("alpha", "[a-z-]")
    );
    Matcher m = p.matcher(text);
    while (m.find()) {
        System.out.println(m.group());
    }

This prints:

--xx
f-afasdf
asdf-
-kjs-
xx--

References

regular-expressions.info/Lookarounds, Character classes

But what should the alphabet be?

You may have to use the Unicode character classes etc (stay put, researching on topic right now)

This will match a single word:

`([^\s]+)`

继续阅读：regex text-segmentation

regular expression that extracts words from a string

References

But what should the alphabet be?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

References

But what should the alphabet be?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？