开发者

Linq memory question

As I am rather new to linq, I'd like to ask if my understanding is correct in the following example.

Let's assume that I have very large collection of animal names (100k records), I'd like to filer them and process the filtered items in a very time consuming method(2 weeks). The m开发者_如何学运维ethods RunWithLinq() and RunWithoutLinq() do completely the same.

Is that true that using the first method the original(large) collection will stay in memory after leaving the method, and will not be touched by GC, whereas using the linq-less method the collection will be removed by GC?

I'd be grateful for a piece of explanation.

class AnimalProcessor
{
    private IEnumerable<string> animalsToProcess;
    internal AnimalProcessor(IEnumerable<string> animalsToProcess)
    {
        this.animalsToProcess = animalsToProcess;
    }
    internal void Start()
    {
        //do sth for 2 weeks with the collection
    }
}
class Program
{
    static void RunWithLinq()
    {
        var animals = new string[] { "cow", "rabbit", "newt", "ram" };
        var filtered = from animal in animals
                       where animal.StartsWith("ra")
                       select animal;
        AnimalProcessor ap = new AnimalProcessor(filtered);
        ap.Start();
    }
    static void RunWithoutLinq()
    {
        var animals = new string[] { "cow", "rabbit", "newt", "ram" };
        var filtered = new List<string>();
        foreach (string animal in animals)
            if(animal.StartsWith("ra")) filtered.Add(animal);
        AnimalProcessor ap = new AnimalProcessor(filtered);
        ap.Start();
    }
}


Well, animals will be eligible for collection by the end of each method, so strictly your statement is false. animals becomes eligible for collection sooner in the non-LINQ case, so the gist of your statement is true.

It is true that the memory use of each differs. However, there is an implication here that LINQ is generally worse in terms of memory usage, while in reality it very often allows for much better memory usage than the other sort of approach taken (though there are non-LINQ ways of doing the same as the LINQ way, I was quite fond of the same basic approach to this particular issue when I used .NET2.0).

Let's consider the two methods, non-LINQ first:

var animals = new string[] { "cow", "rabbit", "newt", "ram" };
var filtered = new List<string>();
foreach (string animal in animals)
//at this point we have both animals and filtered in memory, filtered is growing.
    if(animal.StartsWith("ra")) filtered.Add(animal);
//at this point animals is no longer used. While still "in scope" to the source
//code, it will be available to collection in the produced code.
AnimalProcessor ap = new AnimalProcessor(filtered);
//at this point we have filtered and ap in memory.
ap.Start();
//at this point ap and filtered become eligible for collection.

It's worth noting two things. One "eligible" for collection does not mean collection will happen at that point, just that it can at any point in the future. Two, collection can happen while an object is still in scope if it doesn't get used again (and even in some cases where it is used, but that's another level of detail). Scope rules relate to program source and are a matter of what can happen as the program is written (the programmer can add code that uses the object), GC collection eligibility rules relate to the compiled program and are a matter of what did happen when the program was written (the programmer could have added such code, but they didn't).

Now lets look at the LINQ case:

var animals = new string[] { "cow", "rabbit", "newt", "ram" };
var filtered = from animal in animals
               where animal.StartsWith("ra")
               select animal;
// at this pint we have both animals and filtered in memory.
// filtered defined as a class that acts upon animals.
AnimalProcessor ap = new AnimalProcessor(filtered);
// at this point we have ap, filtered and animals in memory.
ap.Start();
// at this point ap, filtered and animals become eligible for collection.

So here in this case none of the relevant objects can be collected until the very end.

However, note that filtered is never a large object. In the first case filtered is list that contains somewhere in the range of 0 to n objects, where n is the size of animals. In the second case, filtered is an object that will work upon animals as needed and in itself has essentially constant memory.

Hence the peak memory use of the non-LINQ version is higher, as there will be a point where animals still exists and filtered contains all relevant objects. As the size of animals increases with changes to the program, it is actually the non-LINQ version that is most likely to hit a serious memory shortage first, because of the peak-memory-use state being worse in the non-LINQ case.

Another thing to consider, is that in a real-world case where we had enough items to worry about memory consumption, it is like that our source is not going to be a list. Consider:

IEnumerable<string> getAnimals(TextReader rdr)
{
  using(rdr)
    for(string line = rdr.ReadLine(); line != null; line = rdr.ReadLine())
      yield return line;
}

This code reads a text file and returns each line at a time. If each line held the name of an animal, we could use this instead of var animals as our source to filtered.

In this case though the LINQ version has very little memory use (only ever needing one animal name to be in memory at a time) while the non-LINQ version has much greater memory use (loading each animal name that beings with "ra" into memory before further action). The LINQ version will also start processing after a few milliseconds at most, while the non-LINQ version has to load everything first, before it can do a single piece of work.

Hence the LINQ version could happily deal with gigabytes of data without using more memory than it would take to deal with a handful, while the non-LINQ version would struggle with memory issues.

Finally, it's important to note that this doesn't really have anything to do with LINQ itself, as to differences between the approach you take with the LINQ and the approach you take without LINQ. To make the LINQ equivalent to the non-LINQ use:

var filtered = (from animal in animals
                   where animal.StartsWith("ra")
                   select animal).ToList();

To make the non-LINQ equivalent to the LINQ use

var filtered = FilterAnimals(animals);

where you also define:

private static IEnumerable<string> FilterAnimals(IEnumerable<string> animals)
{
  foreach(string animal in animals)
    if(animal.StartsWith("ra"))
      yield return animal;
}

Which uses .NET 2.0 techniques but you can do the same even with .NET 1.1 (though with more code) in creating an object derived from IEnumerable


The LINQ-based method will keep the original collection in memory, but will not store a separate collection with the filtered items.

To change this behavior, call .ToList().


Yes, that's right - because the filtered variable is essentially the query, not the results of the query. Iterating over it will re-evaluate the query each time.

If you want to make them the same, you can just call ToList:

var filtered = animals.Where(animal => animal.StartsWith("ra"))
                      .ToList();

(I've converted it from query expression syntax to "dot notation" because in this case it's simpler that way.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜