开发者

Algorithm to store Item-to-Item-Associations

I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:

Map<Item, Map<Item, Float>>

I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in开发者_JAVA百科 memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?


Here are some ideas:

  1. Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.

  2. Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.

  3. Use Commons Collections' Flat3Map for the inner maps.

  4. If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap


You have two options.

1) Reduce what you're storing.

If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.

Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.

2) Expand your capacity.

I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.

However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜