开发者

regular expression that extracts words from a string

I want to extract all words from a java Stri开发者_运维技巧ng.

word can be written in any european language, and does not contain spaces, only alpha symbols.

it can contain hyphens though.


If you aren't tied to regular expressions, also have a look at BreakIterator, in particular the getWordInstance() method:

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.


You can use a variation of (?<!\S)\S+(?!\S), i.e. any maximal sequence of non-whitespace characters.

  • Negative lookarounds are used so that it can match "words" at the beginning and end of string
  • Substitute your own character class for \S to look for something more specific
    • (e.g. [A-Za-z-], etc)

Here's a simple example to illustrate the idea, using [a-z-] as the alphabet character class:

    String text = "--xx128736f-afasdf2137asdf-12387-kjs-23xx--";
    Pattern p = Pattern.compile(
        "(?<!alpha)alpha+(?!alpha)".replace("alpha", "[a-z-]")
    );
    Matcher m = p.matcher(text);
    while (m.find()) {
        System.out.println(m.group());
    }

This prints:

--xx
f-afasdf
asdf-
-kjs-
xx--

References

  • regular-expressions.info/Lookarounds, Character classes

But what should the alphabet be?

You may have to use the Unicode character classes etc (stay put, researching on topic right now)


This will match a single word:

`([^\s]+)`
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜