HashMap in Java, 100 Million entries
I want to store 100 Million terms and their frequencies (in a text database ) into a HashMap <String, Double>
. It is giving me "Out of Memory" Error. I tried to increase the heap-space to -Xmx15000M
. However it runs half an hour then ag开发者_如何学Goain throw the same exception. The file size from which I'm trying to read the words and frequencies is 1.7GB.
Any help would be much appreciated.
Thanks :-)
For word processing like that the answer is usually a tree rather than hashmap, if you can live with the longer lookup times. That structure is quite memory efficient for natural languages, where many words have common start strings.
Depending on the input, a Patricia tree might be even better.
(Also, if this is indeed words from a natural language, are you sure you really need 100,000,000 entries? The majority of commonly used words is surprisingly low, commercial solutions (word prediction, spelling correction) rarely use more than 100,000 words regardless of language.)
Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.
Update
I don't think allocating 15 GB for the heap is possible for most jvms. It wont work with any 32bit jvm and I don't think that a 64bit jvm would work either. 15 GB of memory should work on a 64 bit jvm when enough RAM is available.
A 1.7 GB file is a relatively small file to do this with and store in RAM. I do this with much larger files and store them in memory without a problem. A database could be used, but may be overkill or may be perfect depending on what you plan to do with the data.
As others have said, with natural language, there will likely be a relatively small number of unique values, so the map will not actually get that large. I would not use a java.util.HashMap as it is is very inefficient in terms of memory usage especially when storing primitive values such as ints. java.util.HashMap stores primitives as objects. It also stores each value inside of a HashMap.Entry object which wastes memory. Because of these two factors, the java.util.HashMap uses much more memory than alternatives such as Trove, Fastutil and others:
- An Overview of memory saving techniques in Java
- Memory consumption of popular Java data types – part 2
As mentioned there are several map implementations which do not have these problems. Since you are storing numbers in your map, an extra benefit is that you will get a performance boost because there is no need to constantly switch between objects and primitives (i.e. boxing/unboxing) as you are putting new values in the map or updating old values. A benchmark of various primitive hashmaps that are better suited for large amounts of data can be found on this post at the Java Performance Tuning Guide:
With 100 million terms you are almost certainly over the limit of what should be stored in-memory. Store your terms in a database of some kind. Either use a commercial database, or write something that allows you to access the file to get the information you want. If the file format you have doesn't let you quickly access the file then convert it to one that does - for example make each record a fixed size, so you can instantly calculate the file offset for any record number. Sorting the records will then allow you to do a binary search very quickly. You can also write code to hugely speed up access to the files without needing to store the whole file in memory.
If you want just a lightweight KeyValue (Map) store, I would look into using Redis. It is very fast and has the ability to persist the data if it needs. The only downside is that you need to run the Redis store on a linux machine.
If you are limited to Windows, MongoDB is a good option if you can run it in 64bit.
You could also try stemming to increase the number of duplicates.
For instance, cat = Cats = cats = Cat
or
swim = swimming = swims
try Googling "Porter Stemmer"
Trove THashMap uses a lot less memory. Still, doubt if that would be enough of a reduction in size. You need somewhere else to store this information for retrieval besides strictly in memory.
Other answers have already pointed out that the problem lies with memory usage. Depending on your problem domain you could design a key class that reduced the overall memory footprint. For example, if your key consists of natural language phrases you could separate and intern the words that make up the phrase; e.g.
public class Phrase {
private final String[] interned;
public Phrase(String phrase) {
String[] tmp = phrase.split(phrase, "\\s");
this.interned = new String[tmp.length];
for (int i=0; i<tmp.length; ++i) {
this.interned[i] = tmp[i].intern();
}
}
public boolean equals(Object o) { /* TODO */ }
public int hashCode() { /* TODO */ }
}
In fact this solution might work even if the Strings do not represent natural language, providing there is significant overlap that can be exploited between Strings.
Drop the HashMap
and load all that data into HBase or one of the other NoSQL datastores and write your queries in terms of MapReduce operations. This is the approach taken by Google Search and many other sites dealing with huge amounts of data. It has proven to scale to basically infinite size.
Consider replacing it with a cdb. Up to 4 GB and:
A successful lookup in a large database normally takes just two disk accesses. An unsuccessful lookup takes only one.
Its a bad design. Having 1.7GB of data in memory on a HashMap, I would have done any of the two:
Persist all the data (file/database) and have the top 1% or something in memory. Use some algorithm for deciding which IDs will be in memory and when.
Use memcached. The easiest way out. An in-memory distributed hashable. This is exactly what DHTs are used for.
There is interesting offering from Terracotta - BigMemory which seems to be exactly what you're want. I haven't tried it myself and don't know licensing terms etc. though.
Back of the envelope: 1.7Gb/100M = avg 18 bytes = per term and freq
We can use a handcoded hashmap backed by two logical arrays.
One to hold int frequencies (values) and the other is to build a C style char array to simulate a two dimensional c array (an array of char arrays). so we index by calculation. we cannot use a java two dimensional array since it comes with too much object overhead. This char array can hold fixed size char arrays to represent the keys. So we calculate the hash of the key and put it in this "two dimensional array" and if we have a conflict it can be resolved by say linear probing. key and value pairs are tied by the common index of the arrays.
The hashmap has to use open addressing since we do not have enough memory for chaining.
We can have say 10 instances of this hashmap based on the length of the keys; cannot be certain since I don't know the characteristics of data.
Space used = 2 power 29 for int array + (2 power 4 (16 bytes per string) * 2 pow 27) = 3.5 gig
If we want double frequencies instead of ints then we may need to reduce the size of the strings appropriately.
In java, an object has overhead of 16 bytes as a minimum size before you consider what other content it holds.
1e8 items in a hash map has an underestimated size requirement of 1e8 * 2 * 16 Bytes, and that is assuming your keys and values are Numbers so that requires a few GB of heap available in your heap and from your computer.
A string is an object holding a character array so your strings as mentioned by many above may be larger than a Double object for example, hence you would need more memory available to the heap.
Note that programs begin to perform poorly when you near the limit of your computer too.
If you are not wanting to use a database as suggested above, you could consider encoding and compressing your keys to make them into numbers that you can still count the frequency of. You could choose an entropy based encoding based upon the frequency of words in that first encoding and go from there...
For the reason why it failed, I would agree with the above answers.
DB is good choice.. But even comercial level of DB, they would also suggest 'Partitioning' the data to do effective action.
Depending on your environment, I might suggest to use distribute your data multiple nodes that connedte through LAN. Based on the Key value,
Node 01 has key starting with 'a' Node 02 has key starging with 'b'....
So your program suddenly changed to network programming..
精彩评论