开发者

LINQ - Splitting up a string with maximum length, but not chopping words apart

I have a simple LINQ Extension Method...

    public sta开发者_如何学JAVAtic IEnumerable<string> SplitOnLength(this string input, int length)
    {
        int index = 0;
        while (index < input.Length)
        {
            if (index + length < input.Length)
                yield return input.Substring(index, length);
            else
                yield return input.Substring(index);

            index += length;
        }
    }

This takes a string, and it chops it up into a collection of strings that do not exceed the given length.

This works well - however I'd like to go further. It chops words in half. I don't need it to understand anything complicated, I just want it to be able to chop a string off 'early' if cutting it at the length would be cutting in the middle of text (basically anything that isn't whitespace).

However I suck at LINQ, so I was wondering if anyone had an idea on how to go about this. I know what I am trying to do, but I'm not sure how to approach it.

So let's say I have the following text.

This is a sample block of text that I would pass through the string splitter.

I call this method SplitOnLength(6) I would get the following.

  • This i
  • s a sa
  • mple b
  • lock o
  • f text
  • that I
  • would
  • pass t
  • hrough
  • the s
  • tring
  • splitt
  • er.

I would rather it be smart enough to stop and look more like ..

  • This
  • is a
  • sample

// bad example, since the single word exceeds maximum length, but the length would be larger numbers in real scenarios, closer to 200.

Can anyone help me?


Update 2

I noticed in Saeed's other answer that he skipped my suggestion since it threw an exception. This is an interesting topic. Whenever we as developers work on a solution to a problem, we need to consider the exceptional cases—the unexpected inputs, the invalid states, etc. I felt that since the requirement of the question was:

I just want it to be able to chop a string off 'early' if cutting it at the length would be cutting in the middle of text (basically anything that isn't whitespace).

...that, in the event that this is impossible (i.e., in order to return a substring of the specified length it is necessary to cut a word in half because the word is too long), it would be appropriate to throw an exception. But obviously this is a subjective matter. However, realizing that there is more than one way to skin this cat, I have updated my solution (both on pastebin and below) to take a WordPolicy enum rather than a simple bool. This enum has three values: None (equivalent to false from before), ThrowIfTooLong (equivalent to true from before), and CutIfTooLong (if the word must be cut, just cut it).

I also added benchmarks of some of the other answers**, included below. Note that this time around I tested multiple runs with different length parameters (5, 10, 25, 200, 500, 1000). For the shortest length (5), the results seem somewhat even, with Saeed's suggestion on top. As length grows larger, the performance of Saeed's suggestion grows worse and worse. Simon's and Jay's suggestions appear to be significantly more scalable in the case of large inputs.

Also keep in mind that the OP has explicitly said that the value for length would be "closer to 200" in a realistic scenario; so using 200 as an input is not contrived. It is in fact a realistic case.

New Benchmarks

Enter a split size: 5

Results from longer input:
Saeed (original): 2.8886 ms
Saeed: 3.2685 ms
Dan: 3.3163 ms
Simon: 7.4182 ms
Jay: 36.7348 ms

Results from shorter input:
Saeed (original): 0.031 ms
Saeed: 0.0357 ms
Dan: 0.0514 ms
Simon: 0.1047 ms
Jay: 0.2885 ms

Go again? y

Enter a split size: 10

Results from longer input:
Dan: 1.8798 ms
Saeed: 3.8205 ms
Saeed (original): 4.0899 ms
Simon: 4.9869 ms
Jay: 14.9627 ms

Results from shorter input:
Dan: 0.022 ms
Saeed (original): 0.0396 ms
Saeed: 0.0466 ms
Simon: 0.0483 ms
Jay: 0.2022 ms

Go again? y

Enter a split size: 25

Results from longer input:
Dan: 0.6713 ms
Simon: 2.7506 ms
Saeed: 4.5075 ms
Saeed (original): 4.7827 ms
Jay: 6.3477 ms

Results from shorter input:
Dan: 0.0131 ms
Simon: 0.0301 ms
Saeed (original): 0.0441 ms
Saeed: 0.0488 ms
Jay: 0.1176 ms

Go again? y

Enter a split size: 200

Results from longer input:
Dan: 0.1952 ms
Jay: 1.5764 ms
Simon: 1.8175 ms
Saeed (original): 6.8025 ms
Saeed: 7.5221 ms

Results from shorter input:
Dan: 0.0092 ms
Simon: 0.0206 ms
Saeed: 0.0581 ms
Saeed (original): 0.0586 ms
Jay: 0.0812 ms

Go again? y

Enter a split size: 500

Results from longer input:
Dan: 0.1463 ms
Jay: 1.2923 ms
Simon: 1.7326 ms
Saeed (original): 8.686 ms
Saeed: 9.1726 ms

Results from shorter input:
Dan: 0.0061 ms
Simon: 0.0192 ms
Saeed (original): 0.0664 ms
Saeed: 0.0748 ms
Jay: 0.0832 ms

Go again? y

Enter a split size: 1000

Results from longer input:
Dan: 0.1293 ms
Jay: 1.1683 ms
Simon: 1.7292 ms
Saeed: 11.7121 ms
Saeed (original): 11.8752 ms

Results from shorter input:
Dan: 0.0058 ms
Simon: 0.0187 ms
Saeed (original): 0.0765 ms
Jay: 0.0801 ms
Saeed: 0.084 ms

Go again? n

**Unfortunately, with the "large" input (the short story), I couldn't test Jay's original approach which was incredibly costly—an unbelievably deep call stack due to the recursion, plus an insane number of very large string allocations due to the string.Substring call on a huge string.


Update

I hope this doesn't come across as defensive, but I think there is some very misleading information being presented amidst the comments and some other answers here. In particular Saeed's answer, the accepted answer, has some definite efficiency shortcomings. This wouldn't be such a big deal except that Saeed has claimed in some comments that other answers (including this one) are less efficient.

Comparison of performance

First of all, numbers don't lie. Here is a sample program I wrote to test the method below as well as Saeed's method against example inputs, long and short.

And here are some sample results:

Results from longer input:
SplitOnLength: 0.8073 ms
SaeedsOriginalApproach: 4.724 ms
SaeedsApproach: 4.9095 ms

Results from shorter input:
SplitOnLength: 0.0156 ms
SaeedsOriginalApproach: 0.0522 ms
SaeedsApproach: 0.046 ms

SplitOnLength represents my method below, SaeedsOriginalApproach represents the first suggestion in Saeed's answer, and SaeedsApproach represents Saeed's updated answer using deferred execution. The test inputs were:

  • The entire text of Franz Kafka's 'In the Penal Colony' (a short story—long input)
  • An excerpt from the text of the OP's question (short input)
  • A length parameter of 25 (to accommodate some longer words)

Notice that SplitOnLength executed in a fraction of the time of the methods suggested in Saeed's answer. I will concede, however, that the length parameter is likely to have an effect on this. With a smaller length, I wouldn't be surprised to see the performance of SaeedsOriginalApproach and SaeedsApproach improve significantly*.

Explanation

Now, just a few remarks on what I think is going on here. Right off the bat, Saeed's answer starts with a call to string.Split. Now here's something interesting Microsoft has to say about this method:

The Split methods allocate memory for the returned array object and a String object for each array element. If your application requires optimal performance or if managing memory allocation is critical in your application, consider using the IndexOf or IndexOfAny method, and optionally the Compare method, to locate a substring within a string. [emphasis mine]

In other words, the string.Split method is not endorsed by Microsoft as an appropriate way to achieve optimal performance. The reason I highlight the last line is that Saeed seems to be specifically questioning the efficiency of answers that use string.Substring. But this is the most efficient way to solve this problem, period.

Then inside the method suggested by Saeed, we have code that looks like this:

ret2[index] += ' ' + item;

*This is the reason I believe that the performance of Saeed's approach degrades with higher and higher length parameters. As those of us who've been bitten by the performance of string concatenation before know, this is allocating a new string object on every append, which is wasteful. The problem only becomes worse as length gets longer. Notice the SplitOnLength method below does not suffer from this problem.

Another way of looking at this is simply by breaking down the different steps that need to take place. Below, let N be the length of the input string and K be the specified maximum length of a substring.

Saeed's answer

  1. First, string.Split(' '). This requires enumerating over the entire string once, and the allocation of a string object for every word in the string—which is likely to be more than we need (we're almost certainly going to concatenate some of them)—as well as an array to contain these objects. Let the number of strings in the array now be X.
  2. Then there is a second enumeration over the X results of #1. During this enumeration, there are 0–X string concatenations using +=, allocating between 0–X new string objects.

SplitOnLength (below)

  1. The string is likely never enumerated fully. There are at most N comparisons performed, but in a more realistic case the number is less than N. This is because on each iteration the algorithm optimistically goes straight to the next "best" index and only steps backwards if/as necessary to find the most recent whitespace.
  2. The only new string objects allocated are exactly those that are needed, using string.Substring.

string.Substring, by the way, is very fast. It can be because there is no "extra" work to do; the exact length of the substring is known in advance, so there is no waste created during allocation (as would happen with, e.g., the += operator).

Again, the reason I am adding this behemoth update to my answer is to point out the performance issues with Saeed's suggestion and to dispute his claims about the alleged efficiency problems in some of the other answers.


Original Answer

My instinctive approach was to start with what you had and augment it slightly, with the addition of a little bit of code that would:

  1. Check if index + length is followed by a whitespace; and, if not:
  2. Step backwards into the string until finding a whitespace.

Here's the modified SplitOnLength method:

enum WordPolicy
{
    None,
    ThrowIfTooLong,
    CutIfTooLong
}

public static IEnumerable<string> SplitOnLength(this string input, int length, WordPolicy wordPolicy)
{
    int index = 0;
    while (index < input.Length)
    {
        int stepsBackward = 0;

        if (index + length < input.Length)
        {
            if (wordPolicy != WordPolicy.None)
            {
                yield return GetBiggestAllowableSubstring(input, index, length, wordPolicy, out stepsBackward);
            }
            else
            {
                yield return input.Substring(index, length);
            }
        }
        else
        {
            yield return input.Substring(index);
        }

        index += (length - stepsBackward);
    }
}

static string GetBiggestAllowableSubstring(string input, int index, int length, WordPolicy wordPolicy, out int stepsBackward)
{
    stepsBackward = 0;

    int lastIndex = index + length - 1;

    if (!char.IsWhiteSpace(input[lastIndex + 1]))
    {
        int adjustedLastIndex = input.LastIndexOf(' ', lastIndex, length);
        stepsBackward = lastIndex - adjustedLastIndex;
        lastIndex = adjustedLastIndex;
    }

    if (lastIndex == -1)
    {
        if (wordPolicy == WordPolicy.ThrowIfTooLong)
        {
            throw new ArgumentOutOfRangeException("The input string contains at least one word greater in length than the specified length.");
        }
        else
        {
            stepsBackward = 0;
            lastIndex = index + length - 1;
        }
    }

    return input.Substring(index, lastIndex - index + 1);
}

This approach has the advantage of not doing more work than necessary; note the absence of any string.Split call and the fact that string.Substring is only called in the places where the result is actually being returned. In other words, no superfluous string objects are created using this method—only the ones you actually want.


I'll solve this by for loop:

var ret1 = str.Split(' ');
var ret2 = new List<string>();
ret2.Add("");
int index = 0;
foreach (var item in ret1)
{
    if (item.Length + 1 + ret2[index].Length <= allowedLength)
    {
        ret2[index] += ' ' + item;
        if (ret2[index].Length >= allowedLength)
        {
            ret2.Add("");
            index++;
        }
    }
    else
    {
        ret2.Add(item);
        index++;
    }
}
return ret2;

first I thought about a Zip, but it's not good here.

and differed execution version with yield:

public static IEnumerable<string> SaeedsApproach(this string str, int allowedLength)
{
    var ret1 = str.Split(' ');
    string current = "";
    foreach (var item in ret1)
    {
        if (item.Length + 1 + current.Length <= allowedLength)
        {
            current += ' ' + item;
            if (current.Length >= allowedLength)
            {
                yield return current;
                current = "";
            }
        }
        else
        {
            yield return current;
            current = "";
        }
    }
}


Try using String.Split(' ') to turn the single string into an array of individual words. Then, iterate through them, building the longest string that (with the spaces re-added) is less than the limit, append a newline and yield.


I suck with LINQ too :-), but here is a code that works without allocating anything (except output words of course), removes whitespaces, removes empty string, trims strings, and never breaks words (it's a design choice) - I'd be interested to see a full LINQ equivalent:

public static IEnumerable<string> SplitOnLength(this string input, int length)
{
    if (input == null)
        yield break;

    string chunk;
    int current = 0;
    int lastSep = -1;
    for (int i = 0; i < input.Length; i++)
    {
        if (char.IsSeparator(input[i]))
        {
            lastSep = i;
            continue;
        }

        if ((i - current) >= length)
        {
            if (lastSep < 0) // big first word case
                continue;

            chunk = input.Substring(current, lastSep - current).Trim();
            if (chunk.Length > 0)
                yield return chunk;

            current = lastSep;
        }
    }
    chunk = input.Substring(current).Trim();
    if (chunk.Length > 0)
        yield return chunk;
}


I had to put an answer forward as I felt that the other answers relied too much on indexing and complicated logic. I think my answer is quite a bit simpler.

public static IEnumerable<string> SplitOnLength(this string input, int length)
{
    var words = input.Split(new [] { " ", }, StringSplitOptions.None);
    var result = words.First();
    foreach (var word in words.Skip(1))
    {
        if (result.Length + word.Length > length)
        {
            yield return result;
            result = word;
        }
        else
        {
            result += " " + word;
        }
    }
    yield return result;
}

The result for the sample string provided in the OP is:

This is
a
sample
block
of text
that I
would
pass
through
the
string
splitter.


public static IEnumerable<string> SplitOnLength(this string source,int maxLength)
{
   //check parameters' validity and then
   int currentIndex = 0;
   while (currentIndex + maxLength < source.Length)
   {
      int prevIndex = currentIndex;
      currentIndex += maxLength;
      while (currentIndex >= 0 && source[currentIndex] != ' ') currentIndex--;
      if (currentIndex <= prevIndex)
             throw new ArgumentException("invalid maxLength");
      yield return source.Substring(prevIndex, currentIndex - prevIndex);
      currentIndex++;
   }
   yield return source.Substring(currentIndex);
}

Test case:

"this is a test".SplitOnLength(5).ToList()
                .ForEach(x => Console.WriteLine("|" + x + "|"));

Output:

|this|
|is a|
|test|


Ok, I tested all ways (jay RegEx way, not LINQ one) also I got exception by Dan taos way when I set a maintainWords to true for all position so I skip it.

This is what I'd done:

List<DifferentTypes> smallStrings = new List<DifferentTypes>();
List<DifferentTypes> mediomStrings = new List<DifferentTypes>();
List<DifferentTypes> largeStrings = new List<DifferentTypes>();

for (int i = 0; i < 10; i++)
{
    string strSmallTest = "This is a small string test for different approachs provided here.";

    smallStrings.Add(Approachs(strSmallTest, "small"));

    string mediomSize = "Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe."
                        + "Windows 7, Windows Vista SP1 or later, Windows XP SP3, Windows Server 2008 (Server Core Role not supported), Windows Server 2008 R2 "
                        + "(Server Core Role not supported), Windows Server 2003 SP2"
                        + " .NET Framework does not support all versions of every platform. For a list of the supported versions, see .NET Framework System Requirements. ";
    mediomStrings.Add(Approachs(mediomSize, "Mediom"));

    string largeSize =
        "This is a question that I get very frequently, and I always tried to dodged the bullet, but I get it so much that I feel that I have to provide an answer. Obviously, I am (not so) slightly biased toward NHibernate, so while you read it, please keep it in mind." +
        "EF 4.0 has done a lot to handle the issues that were raised with the previous version of EF. Thinks like transparent lazy loading, POCO classes, code only, etc. EF 4.0 is a much nicer than EF 1.0." +
        "The problem is that it is still a very young product, and the changes that were added only touched the surface. I already talked about some of my problems with the POCO model in EF, so I won’t repeat that, or my reservations with the Code Only model. But basically, the major problem that I have with those two is that there seems to be a wall between what experience of the community and what Microsoft is doing. Both of those features shows much of the same issues that we have run into with NHibernate and Fluent NHibernate. Issues that were addressed and resolved, but show up in the EF implementations." +
        "Nevertheless, even ignoring my reservations about those, there are other indications that NHibernate’s maturity makes itself known. I run into that several times while I was writing the guidance for EF Prof, there are things that you simple can’t do with EF, that are a natural part of NHibernate." +
        "I am not going to try to do a point by point list of the differences, but it is interesting to look where we do find major differences between the capabilities of NHibernate and EF 4.0. Most of the time, it is in the ability to fine tune what the framework is actually doing. Usually, this is there to allow you to gain better performance from the system without sacrificing the benefits of using an OR/M in the first place.";

    largeStrings.Add(Approachs(largeSize, "Large"));

    Console.WriteLine();
}

Console.WriteLine("/////////////////////////");
Console.WriteLine("average small for saeed: {0}", smallStrings.Average(x => x.saeed));
Console.WriteLine("average small for Jay: {0}", smallStrings.Average(x => x.Jay));
Console.WriteLine("average small for Simmon: {0}", smallStrings.Average(x => x.Simmon));

Console.WriteLine("/////////////////////////");
Console.WriteLine("average mediom for saeed: {0}", mediomStrings.Average(x => x.saeed));
Console.WriteLine("average mediom for Jay: {0}", mediomStrings.Average(x => x.Jay));
Console.WriteLine("average mediom for Simmon: {0}", mediomStrings.Average(x => x.Simmon));

Console.WriteLine("/////////////////////////");
Console.WriteLine("average large for saeed: {0}", largeStrings.Average(x => x.saeed));
Console.WriteLine("average large for Jay: {0}", largeStrings.Average(x => x.Jay));
Console.WriteLine("average large for Simmon: {0}", largeStrings.Average(x => x.Simmon));

And:

private static DifferentTypes Approachs(string stringToDecompose, string text2Write)
{
    DifferentTypes differentTypes;
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < 1000; i++)
    {
        var strs = stringToDecompose.SaeedsApproach(10);
        foreach (var item in strs)
        { 
        }
    }
    sw.Stop();
    Console.WriteLine("Saeed's Approach takes {0} millisecond for {1} strings", sw.ElapsedMilliseconds, text2Write);
    differentTypes.saeed = sw.ElapsedMilliseconds;

    sw.Restart();
    for (int i = 0; i < 1000; i++)
    {
        var strs = stringToDecompose.JaysApproach(10);
        foreach (var item in strs)
        {
        }
    }
    sw.Stop();
    Console.WriteLine("Jay's Approach takes {0} millisecond for {1} strings", sw.ElapsedMilliseconds, text2Write);
    differentTypes.Jay = sw.ElapsedMilliseconds;

    sw.Restart();
    for (int i = 0; i < 1000; i++)
    {
        var strs = stringToDecompose.SimmonsApproach(10);
        foreach (var item in strs)
        {
        }
    }
    sw.Stop();
    Console.WriteLine("Simmon's Approach takes {0} millisecond for {1} strings", sw.ElapsedMilliseconds, text2Write);
    differentTypes.Simmon = sw.ElapsedMilliseconds;

    return differentTypes;
}

The results:

average small for saeed: 4.6
average small for Jay: 33.9
average small for Simmon: 5.6

average mediom for saeed: 28.7
average mediom for Jay: 173.9
average mediom for Simmon: 38.7

average large for saeed: 115.3
average large for Jay: 594.2
average large for Simmon: 138.7

Test it simply in your PCs and feel free to edit this to leave your test result or improve current function. I'm sure if we test it with larger strings we can see very difference between my approach and yours.

Edit: I edited my approach to use foreach and yield, see my code above. the result is:

average small for saeed: 6.5
average small for Jay: 34.5
average small for Simmon: 5.9

average mediom for saeed: 30.6
average mediom for Jay: 157.9
average mediom for Simmon: 35

average large for saeed: 122.4
average large for Jay: 584
average large for Simmon: 157

Here is my (Jay's) test:

class Program
{
    static void Main()
    {
        var s =
            "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, " 
            + "sed diam nonummy nibh euismod tincidunt ut laoreet dolore " 
            + "magna aliquam erat volutpat. Ut wisi enim ad minim veniam, " 
            + "quis nostrud exerci tation ullamcorper suscipit lobortis nisl " 
            + "ut aliquip ex ea commodo consequat. Duis autem " 
            + "vel eum iriure dolor in hendrerit in vulputate velit " 
            + "esse molestie consequat, vel illum dolore eu feugiat " 
            + "nulla facilisis at vero eros et accumsan et iusto " 
            + "odio dignissim qui blandit praesent luptatum zzril delenit augue " 
            + "duis dolore te feugait nulla facilisi. Nam liber tempor " 
            + "cum soluta nobis eleifend option congue nihil imperdiet doming id " 
            + "quod mazim placerat facer possim assum. Typi non habent " 
            + "claritatem insitam; est usus legentis in iis qui facit " 
            + "eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod " 
            + "ii legunt saepius. Claritas est etiam processus dynamicus, " 
            + "qui sequitur mutationem consuetudium lectorum. Mirum est notare quam " 
            + "littera gothica, quam nunc putamus parum claram, anteposuerit " 
            + "litterarum formas humanitatis per seacula quarta decima et quinta decima" 
            + ". Eodem modo typi, qui nunc nobis videntur parum clari" 
            + ", fiant sollemnes in futurum.";
        s += s;
        s += s;
        s += s;

        var watch = new Stopwatch();

        watch.Start();
        for (int i = 1; i <= 10000; i++) s.JaysApproach(200).ToList();
        watch.Stop();

        Console.WriteLine("Jay:   {0}", watch.ElapsedTicks / 10000);

        watch.Reset();

        watch.Start();
        for (int i = 1; i <= 10000; i++) s.SaeedsApproach(200);
        watch.Stop();

        Console.WriteLine("Saeed: {0}", watch.ElapsedTicks / 10000);

        watch.Reset();

        watch.Start();
        for (int i = 1; i <= 10000; i++) s.SimonsApproach(200).ToList();
        watch.Stop();

        Console.WriteLine("Simon: {0}", watch.ElapsedTicks / 10000);

        Console.ReadLine();
    }
}

Results:

4 lorem ipsums (as shown):
    Jay:   317
    Saeed: 1069
    Simon: 599

3 lorems ipsums:
    Jay:   283
    Saeed: 862
    Simon: 465

2 lorem ipsums:
    Jay:   189
    Saeed: 417
    Simon: 236

1 lorem ipsum:
    Jay:   113
    Saeed: 204
    Simon: 118


update:

The one-liner:

    public static IEnumerable<string> SplitOnLength(this string s, int length)
    {
        return Regex.Split(s, @"(.{0," + length + @"}) ")
            .Where(x => x != string.Empty);
    }

I've profiled this versus the accepted answer with ~9,300 character source (lorem ipsum x4) split at-or-before 200 characters.

10,000 passes:
- loop takes ~4,200 ms
- mine takes ~1,200 ms

original answer:

This method will shorten the result to avoid breaking words, except when the word exceeds the specified length, in which case it will break it.

public static IEnumerable<string> SplitOnLength(this string s, int length)
{
    var pattern = @"^.{0," + length + @"}\W";
    var result = Regex.Match(s, pattern).Groups[0].Value;

    if (result == string.Empty)
    {
        if (s == string.Empty) yield break;
        result = s.Substring(0, length);
    }

    yield return result;

    foreach (var subsequent_result in SplitOnLength(s.Substring(result.Length), length))
    {
        yield return subsequent_result;
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜