开发者

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:

  1. What constitutes a word is a string of any character a-zA-Z only (no numbers or anything else)

  2. In between any word, there can be any number of random other characters

  3. I want to get back a string[] containing only the words

eg: text: "apple^&**^orange1247pear"

I want to return: apple, orange, pear in an开发者_C百科 array.

The closest I have found I suppose is this:

Regex.Split("apple^orange7pear",@"([a-zA-Z]*)")

Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.

Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?

Thanks in advance for any help you give me :)


Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:

string[] words = Regex.Split(str, "[^a-zA-Z]+");

Another option is to match the words directly:

MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();

The second option is probably clearer, and will not include blank elements on the start or end of the array.


var splits = Regex.Split("aaa $$$bbb ccc", @"[^A-Za-z]+");

But to include non-latin letters, I would use this:

var splits = Regex.Split("aaa $$$bbb ccc", @"\P{L}+");


Try this:

Regex.Matches("kalle  kula(/()&//()nisse8978971", @"[A-Za-z]+")

Using Matches() will collect only the words, Split() will divide the string which is not what you want.


The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.

var regex = new Regex(@"[\p{L}\p{N}\p{M}]+(?:[-.'´_@][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜