regular expression that extracts words from a string
I want to extract all words from a java Stri开发者_运维技巧ng.
word can be written in any european language, and does not contain spaces, only alpha symbols.
it can contain hyphens though.
If you aren't tied to regular expressions, also have a look at BreakIterator, in particular the getWordInstance() method:
Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.
You can use a variation of (?<!\S)\S+(?!\S)
, i.e. any maximal sequence of non-whitespace characters.
- Negative lookarounds are used so that it can match "words" at the beginning and end of string
- Substitute your own character class for
\S
to look for something more specific- (e.g.
[A-Za-z-]
, etc)
- (e.g.
Here's a simple example to illustrate the idea, using [a-z-]
as the alphabet character class:
String text = "--xx128736f-afasdf2137asdf-12387-kjs-23xx--";
Pattern p = Pattern.compile(
"(?<!alpha)alpha+(?!alpha)".replace("alpha", "[a-z-]")
);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
This prints:
--xx
f-afasdf
asdf-
-kjs-
xx--
References
- regular-expressions.info/Lookarounds, Character classes
But what should the alphabet be?
You may have to use the Unicode character classes etc (stay put, researching on topic right now)
This will match a single word:
`([^\s]+)`
精彩评论