开发者

Splitting a string at all whitespace

I need to split a string at all whitespace, it should ONLY contain the words themselves.

How can I do this in vb.net?

Tabs, Newlines, etc. must all be split!

This has been bugging me for quite a while now, as my syntax hig开发者_Go百科hlighter I made completely ignores the first word in each line except for the very first line.


String.Split() (no parameters) does split on all whitespace (including LF/CR)


Try this:

Regex.Split("your string here", "\s+")


If you want to avoid regex, you can do it like this:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit"
    .Split()
    .Where(x => x != string.Empty)

Visual Basic equivalent:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit" _
    .Split() _
    .Where(Function(X$) X <> String.Empty)

The Where() is important since, if your string has multiple white space characters next to each other, it removes the empty strings that will result from the Split().

At the time of writing, the currently accepted answer (https://stackoverflow.com/a/1563000/49241) does not take this into account.


String.Split() will split on every single whitespace, so the result will contain empty strings usually. The Regex solution Ruben Farias has given is the correct way to do it. I have upvoted his answer but I want to give a small addition, dissecting the regex:

\s is a character class that matches all whitespace characters.

In order to split the string correctly when it contains multiple whitespace characters between words, we need to add a quantifier (or repetition operator) to the specification to match all whitespace between words. The correct quantifier to use in this case is +, meaning "one or more" occurrences of a given specification. While the syntax "\s+" is sufficient here, I prefer the more explicit "[\s]+".


So, after seeing Adam Ralph's post, I suspected his solution of being faster than the Regex solution. Just thought I'd share the results of my testing since I did find it was faster.


There are really two factors at play (ignoring system variables): number of sub-strings extracted (determined by number of delimiters), and total string length. The very simple scenario plotted below uses "A" as the sub-string delimited by two white space characters (a space followed by tab). This accentuates the effect of number of sub-strings extracted. I went ahead and did some multiple variable testing to arrive at the following general equations for my operating system.

Regex()
t = (28.33*SSL + 572)(SSN/10^6)

Split().Where()
t = (6.23*SSL + 250)(SSN/10^6)

Where t is execution time in milliseconds, SSL is average sub-string length, and SSN is number of sub-strings delimited in string.

These equations can also written as

t = (28.33*SL + 572*SSN)/10^6

and

t = (6.23*SL + 250*SSN)/10^6

where SL is total string length (SL = SSL * SSN)

Conclusion: The Split().Where() solution is faster than Regex(). The major factor is number of sub-strings, while string length plays a minor role. Performance gains are about 2x and 5x for the respective coefficients.


Splitting a string at all whitespace


Here's my testing code (probably way more material than necessary, but it's set-up for getting the multi-variable data I talked about)

using System;
using System.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
using System.Windows.Forms;
namespace ConsoleApplication1
{
    class Program
    {
        public enum TestMethods {regex, split};
        [STAThread]
        static void Main(string[] args)
        {
            //Compare TestMethod execution times and output result information
            //to the console at runtime and to the clipboard at program finish (so that data is ready to paste into analysis environment)
            #region Config_Variables
            //Choose test method from TestMethods enumerator (regex or split)
            TestMethods TestMethod = TestMethods.split;
            //Configure RepetitionString
            String RepetitionString =  string.Join(" \t", Enumerable.Repeat("A",100));
            //Configure initial and maximum count of string repetitions (final count may not equal max)
            int RepCountInitial = 100;int RepCountMax = 1000 * 100;

            //Step increment to next RepCount (calculated as 20% increase from current value)
            Func<int, int> Step = x => (int)Math.Round(x / 5.0, 0);
            //Execution count used to determine average speed (calculated to adjust down to 1 execution at long execution times)
            Func<double, int> ExecutionCount = x => (int)(1 + Math.Round(500.0 / (x + 1), 0));
            #endregion

            #region NonConfig_Variables
            string s; 
            string Results = "";
            string ResultInfo; 
            double ResultTime = 1;
            #endregion

            for (int RepCount = RepCountInitial; RepCount < RepCountMax; RepCount += Step(RepCount))
            {
                s = string.Join("", Enumerable.Repeat(RepetitionString, RepCount));
                ResultTime = Test(s, ExecutionCount(ResultTime), TestMethod);
                ResultInfo = ResultTime.ToString() + "\t" + RepCount.ToString() + "\t" + ExecutionCount(ResultTime).ToString() + "\t" + TestMethod.ToString();
                Console.WriteLine(ResultInfo); 
                Results += ResultInfo + "\r\n";
            }
            Clipboard.SetText(Results);
        }
        public static double Test(string s, int iMax, TestMethods Method)
        {
            switch (Method)
            {
                case TestMethods.regex:
                    return Math.Round(RegexRunTime(s, iMax),2);
                case TestMethods.split:
                    return Math.Round(SplitRunTime(s, iMax),2);
                default:
                    return -1;
            }
        }
        private static double RegexRunTime(string s, int iMax)
        {
            Stopwatch sw = new Stopwatch();
            sw.Restart();
            for (int i = 0; i < iMax; i++)
            {
                System.Collections.Generic.IEnumerable<string> ens = Regex.Split(s, @"\s+");
            }
            sw.Stop();
            return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
        }
        private static double SplitRunTime(string s,int iMax)
        {
            Stopwatch sw = new Stopwatch();
            sw.Restart();
            for (int i = 0; i < iMax; i++)
            {
                System.Collections.Generic.IEnumerable<string> ens = s.Split().Where(x => x != string.Empty);
            }
            sw.Stop();
            return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
        }
    }
}


I found I used the solution as noted by Adam Ralph, plus the VB.NET comment below by P57, but with one odd exception. I found I had to add .ToList.ToArray on the end.

Like so:

.Split().Where(Function(x) x <> String.Empty).ToList.ToArray

Without that, I kept getting "Unable to cast object of type 'WhereArrayIterator`1[System.String]' to type 'System.String[]'."


Dim words As String = "This is a list of words, with: a bit of punctuation" + _
                          vbTab + "and a tab character." + vbNewLine
Dim split As String() = words.Split(New [Char]() {" "c, CChar(vbTab), CChar(vbNewLine) })
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜