[Optimize This]: Slow LINQ to Objects Query
I have this query that is bothering me; it is encapsulated as a new query operator, I made two versions of it, trying to see which one performs better. Both perform horribly.
First attempt; declarative style
public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
return source.Any()
开发者_StackOverflow中文版 ? source.Take(length).Cons(source.Skip(length).Section(length))
: Enumerable.Empty<IEnumerable<α>>();
}
Second attempt: imperative "yield return" style
public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
var fst = source.Take(length);
var rst = source.Skip(length);
yield return fst;
if (rst.Any())
foreach (var section in rst.Section(length))
yield return section;
}
In fact the second attempt is worse, both in terms of readability, compositionality and in terms of speed.
Any clues as to how to optimize this?
If I understand your question correctly, you're trying to build a lazy implementation of an enumerator that splits a larger collection of items into smaller enumerable collections of items.
For instance, a sequence of a million numbers could be split up into "sections", each producing only 100 of them, and you want it all done lazily, ie. no collecting 100 items into a list before producing them.
First, your attempts will re-iterate over the collection many times, which is bad, hence the performance problem.
If you're trying to build a pure lazy implementation, you should consider the following issues:
- You want to iterate over the underlying collection only once
- You should return enumerables that re-use the underlying enumerator
- You need to handle that a section you return isn't completely enumerated over (for instance, the calling code only needed the first 50 of those 100 items).
Edit: Before I go into my simplistic solution, here's a few caveats about it:
- You cannot save each section for later, ie. you cannot do:
collection.Sequence(10).ToArray()
to get an array of sections. - You cannot enumerate over each section more than once, because when you do, it changes a hidden underlying data structure.
Basically: My solution is not general purpose. You should go with @LBushkin's comment about MoreLinq Batch if you need this, and I would hesitate to put my code into a class library, it would have to be local where it is needed, or renamed to something that clearly warns you about the issues with it.
Here's a simplistic implementation, I'm pretty sure there are bugs here, so you might want to look at implementing a ton of unit-testing for edgecases:
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApplication20
{
class SectionEnumerable<T> : IEnumerable<T>
{
private readonly IEnumerator<T> _Enumerator;
public SectionEnumerable(IEnumerator<T> enumerator, int sectionSize)
{
_Enumerator = enumerator;
Left = sectionSize;
}
public IEnumerator<T> GetEnumerator()
{
while (Left > 0)
{
Left--;
yield return _Enumerator.Current;
if (Left > 0)
if (!_Enumerator.MoveNext())
break;
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
public int Left { get; private set; }
}
static class SequenceExtensions
{
public static IEnumerable<IEnumerable<T>> Section<T>(this IEnumerable<T> collection, int sectionSize)
{
if (collection == null)
throw new ArgumentNullException("collection");
if (sectionSize < 1)
throw new ArgumentOutOfRangeException("sectionSize");
using (IEnumerator<T> enumerator = collection.GetEnumerator())
{
while (enumerator.MoveNext())
{
SectionEnumerable<T> enumerable = new SectionEnumerable<T>(enumerator, sectionSize);
yield return enumerable;
for (int index = 0; index < enumerable.Left; index++)
if (!enumerator.MoveNext())
yield break;
}
}
}
}
class Program
{
static void Main(string[] args)
{
var sequence = Enumerable.Range(0, 100);
var sections = sequence.Section(10);
foreach (var section in sections)
{
Console.WriteLine(
String.Join(", ",
section.Take(5).ToArray().Select(i => i.ToString()).ToArray()));
}
Console.ReadLine();
}
}
}
Output:
0, 1, 2, 3, 4
10, 11, 12, 13, 14
20, 21, 22, 23, 24
30, 31, 32, 33, 34
40, 41, 42, 43, 44
50, 51, 52, 53, 54
60, 61, 62, 63, 64
70, 71, 72, 73, 74
80, 81, 82, 83, 84
90, 91, 92, 93, 94
Things you should unit-test:
- Empty input collection produces no sections
- A collection that has just the right number of elements, produces only one section
- A collection that contains a multiplum of the section-size elements, (ie. 10, 20, 30, etc. number of elements with a section size of 5 or 10), doesn't produce an empty section after all the expected ones
- That it is actually lazy, if you enumerate over the first 10-element section, but only the first 5 of the second section, only the first 15 elements of the underlying collection is enumerated over
I suspect that the problem you're having is related to the fact that enumerating the final result is at least an O(n^2) operation, possibly worse; I haven't worked it all out in my head yet.
Why is that? Well, suppose you have [1, 2, 3, 4, 5, 6] and you split that up into what you think is { { 1, 2 }, {3, 4}, {5, 6} }
That's not what you've done. You've in fact split this up into { take the first two, take the first two and discard them and then take the next two, take the first two and discard then and then take the next two and discard them and then take the third two }
Notice how each step along the way re-calculate the result? That's because the array could be changing between calls to the enumeration. LINQ was designed to always get you up-to-date results; you write a query that means "skip the first four and iterate the next two", that's exactly what you get -- a query that executes that code when you enumerate it.
Is the original sequence small enough and fast enough that you can read the whole thing into memory and split it all up at once, rather than trying to do so lazily? Alternatively, is the sequence indexible? If all you get is forward access to the sequence and it is too big or slow to read into memory all at once then there is not a whole lot you can do here. But if you have one or both of those properties then you can make this at least linear.
Wherever possible, I try to only iterate through a source once within an operator. If the source is something like the result of a Reverse()
operator, calling Any
, Take
and Skip
could cause a lot of nasty performance.
It's not entirely clear what your operator is trying to do, but if you can do it without reading the source multiple times, that may well help - although it will very much depend on what the input is.
Here's another approach not using linq which was much faster then your second method:
public static IEnumerable<IEnumerable<a>> Section<a>(this IEnumerable<a> source, int length)
{
var enumerator = source.GetEnumerator();
var continueLoop = true;
do
{
var list = new List<a>();
var index = 0;
for (int i = 0; i < length; i++)
{
if (enumerator.MoveNext())
{
list.Add(enumerator.Current);
index++;
}
else
{
continueLoop = false;
break;
}
}
if (list.Count > 0)
{
yield return list;
}
} while (continueLoop);
}
Is this any faster? It should be, because it only needs a single iteration through the source sequence.
public static IEnumerable<IEnumerable<T>> Section<T>(
this IEnumerable<T> source, int length)
{
return source
.Select((x, i) => new { Value = x, Group = i / length })
.GroupBy(x => x.Group, y => y.Value);
}
I had an idea today; check this out
public static IEnumerable<α> Take<α>(this IEnumerator<α> iterator, int count)
{
for (var i = 0; i < count && iterator.MoveNext(); i++)
yield return iterator.Current;
}
public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerator<α> iterator, int length)
{
var sct = Enumerable.Empty<α>();
do
{
sct = iterator.Take(length).ToArray();
if (sct.Any())
yield return sct;
}
while (sct.Any());
}
This is still not super-elegant but at least the implementation is very short and readable.
It may be quite interesting investigating query operators over IEnumerator.
And for convenience
public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
using (var iterator = source.GetEnumerator())
foreach (var e in iterator.Section(length))
yield return e;
}
How about an extension method
public static class IEnumerableExtensions
{
public static IEnumerable<List<T>> InSetsOf<T>(this IEnumerable<T> source, int max)
{
List<T> toReturn = new List<T>();
foreach(var item in source)
{
toReturn.Add(item);
if (toReturn.Count == max)
{
yield return toReturn;
toReturn = new List<T>();
}
}
if (toReturn.Any())
{
yield return toReturn;
}
}
}
some tests:
[TestFixture]
public class When_asked_to_return_items_in_sets
{
[Test]
public void Should_return_the_correct_number_of_sets_if_the_input_contains_a_multiple_of_the_setSize()
{
List<string> input = "abcdefghij".Select(x => x.ToString()).ToList();
var result = input.InSetsOf(5);
result.Count().ShouldBeEqualTo(2);
result.First().Count.ShouldBeEqualTo(5);
result.Last().Count.ShouldBeEqualTo(5);
}
[Test]
public void Should_separate_the_input_into_sets_of_size_requested()
{
List<string> input = "abcdefghijklm".Select(x => x.ToString()).ToList();
var result = input.InSetsOf(5);
result.Count().ShouldBeEqualTo(3);
result.First().Count.ShouldBeEqualTo(5);
result.Last().Count.ShouldBeEqualTo(3);
}
}
Do you need to keep your original source around for some reason as you progress? If not, why don't you use recursion and use hd :: tl style to pull the head, pass the tl into the recursive call, and on any even number recursion merge the two you have sitting around into a section?
With the updated release of the experimental Ix extensions, you could use the Window
or Buffer
operators to create a sliding window, which should achieve what you are after.
精彩评论