Looking for a technique to load large number of objects into an IDictionary in .NET
I need to load about 6 million objects into a Dictionary. The problem I have is that simply adding them to a Dictionary while constructing them fragments memory as dictionary allocates new arrays and deallocates existing ones. In the end, this way I could only load 2 millions of t开发者_JS百科hem into memory due to fragmentation of the free memory. The issue is that I do not know the actual number of the items. It all depends on user input.
My not so perfect solution is this:
1. Use a linked list to store all objects once they are created. I do this as linked lists do not need contiguous free space 2. Create a dictionary with the exact size needed, so no need for re-allocation of internal dictionary arrays 3. copy objects over into the dictionary. This way, I can load up to 3 millionAny suggestions on how I can improve this? Or, are you aware of a free IDictionary implementation that does not use arrays internally.
Thank you
UPDATE: My keys are strings of fixed length depending on value type. Typically about 8 chars long but can be up-to 20 chars. And, the total possible number of items explodes as the key length increases. Fortunately, the current maximum number of items is 12M. The value is a class type of roughly 90-120 bytes in total size per instance
This is a winforms application running on 32-bit windows. And, my typical host machine has 2G of memory. There is a lot of waste of memory in the application that consume a lot of space. Unfortunately, I cannot address them now.
The whole fragmentation issue can be solved by using a capacity:
var d = new Dictionary<int, string>(expectedCapacity);
expectedCapacity
should be calculated pessimistically and with a little room to spare.
but when using it with reference types and/or small value types this should not make much of a difference. I think you should re-check your diagnosis.
Fragmentation is only an issue on the Large Object Heap, and 6 million K,V pairs (~ 6M * 20 = 120 MB) shouldn't do that.
But do realize how a Dictionary grows: when it's full it doubles. So when loading (a little over) 8M items you could end up with a capacity for 16M, with 8M, 4M, 2M etc blocks also placed on the LOH.
That could cause an OOM.
So it is well worth trying to estimate the number of items in advance.
Would some partitioning help?
I've used an approach where I calculate a byte hash using an XOR of the GetHashCode()
of the dictionary key to partition the dictionary into 256 smaller ones. Basically you have an internal Dictionary<byte, Dictionary<K, V>>
that holds the values for the outer IDictionary<K, V>
.
If you started with a large dictionary class like this:
public class LargeDictionary<K, V> : IDictionary<K, V>
{
private readonly Dictionary<byte, Dictionary<K, V>> _inner =
new Dictionary<byte, Dictionary<K, V>>();
private Dictionary<K, V> GetInner(K key)
{
var bs = BitConverter.GetBytes(key.GetHashCode());
var prekey = (byte)(bs[0] ^ bs[1] ^ bs[2] ^ bs[3]);
if (!_inner.ContainsKey(prekey))
{
_inner.Add(prekey, new Dictionary<K, V>());
}
return _inner[prekey];
}
/* See below */
}
Would you be able to start with this and possibly rebuild parts of the inner dictionary to reclaim memory as you go?
Here's the rest of the class:
public void Add(K key, V value)
{
this.GetInner(key).Add(key, value);
}
public bool ContainsKey(K key)
{
return this.GetInner(key).ContainsKey(key);
}
public ICollection<K> Keys
{
get
{
var keys = from pk in _inner.Keys
from k in _inner[pk].Keys
select k;
return keys.ToList();
}
}
public bool Remove(K key)
{
return this.GetInner(key).Remove(key);
}
public bool TryGetValue(K key, out V value)
{
return this.GetInner(key).TryGetValue(key, out value);
}
public ICollection<V> Values
{
get
{
var values = from pk in _inner.Keys
from v in _inner[pk].Values
select v;
return values.ToList();
}
}
public V this[K key]
{
get
{
return this.GetInner(key)[key];
}
set
{
this.GetInner(key)[key] = value;
}
}
public void Add(KeyValuePair<K, V> item)
{
this.GetInner(item.Key).Add(item.Key, item.Value);
}
public void Clear()
{
_inner.Clear();
}
public bool Contains(KeyValuePair<K, V> item)
{
var inner = this.GetInner(item.Key);
return inner.ContainsKey(item.Key)
&& inner[item.Key].Equals(item.Value);
}
public void CopyTo(KeyValuePair<K, V>[] array, int arrayIndex)
{
var source = this.ToArray();
Array.Copy(source, 0, array, arrayIndex, source.Length);
}
public int Count
{
get
{
var counts = from pk in _inner.Keys
select _inner[pk].Count;
return counts.Sum();
}
}
public bool IsReadOnly
{
get { return false; }
}
public bool Remove(KeyValuePair<K, V> item)
{
return this.GetInner(item.Key).Remove(item.Key);
}
public IEnumerator<KeyValuePair<K, V>> GetEnumerator()
{
return _inner.Keys.SelectMany(pk => _inner[pk]).GetEnumerator();
}
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
6 million objects sounds like a lot to keep in the memory of a program, and you probably don't need them all loaded at the same time.
Would it make sense to have it outside of the application ? maybe in a database (possibly using a format like SQLite or SQLServer Compact ) ?
精彩评论