开发者

Creatinge a very, very, large Map in Java

Using Java I would like to create a Map that can grow and grow and potentially be larger than the size of the memory available. Now obviously using a standard POJO HashMap we're going to run out of memory and the JVM will crash. So I was thinking along the lines of a Map that if it becomes aware of memory running low, it can write the current contents to 开发者_如何学Cdisk.

Has anyone implemented anything like this or knows of any existing solutions out there?

What I'm trying to do is read a very large ASCII file (say 50Gb) a line at a time. Each line contains a key and a value. Keys can be duplicated in the file. I'll then store each line in a Map, which is Keys to a List of values. This Map is the object that will just grow and grow.

Any advice greatly appreciated.

Phil

Update:

Thanks for all the comments and advice everyone. With the problem that I described, a Database is the correct, scalable, solution. I should have stated that this is a temporary Map that needs to be created and used for a short period of time to aid in the parsing of a file. In this case, Michael's suggestion to "store only the line number instead of the actual value " is the most appropriate. Marking Michael's answer(s) as the recommended solution.


I think you are looking for a database.


A NoSQL database will be probably easy to setup and it is more akin a map. Check BerkeleyDB Java edition, now from Oracle. It has a map like interface, can be embeddable so no complex setup is needed


Sounds like dumping your huge file into DB.

Well, I had a same situation like this. But, In my case everything was in TXT file format and the throughout the file has the same formatted lines. So, what I did is I just splitted the files into several pieces (possibly, which my JVM can able to process maximum size). Then I called files one by one, to get processed.

Another way, you can directly load your data into database directly.


Seriously, choose a simple database as advised. It's not overhead — you don't have to use JPA or whatnot, just plain JDBC with native SQL. Derby or HSQL, for example, can run in embedded mode, no need to define users, access rights, start the server separately.

The "overhead" will stab you in the back when you've plodden far into the hash map solution and it turns out that you need yet another optimization to avoid the OutOfMemoryException, or the file is not 50 GB, but 75... Really, don't go there.


If you're just wanting to build up the map for data processing (rather than random access in response to requests), then MapReduce may be what you want, with no need to work with a database.

Edit: Note that although many MapReduce introductions focus on the ability to run many nodes, you should still get benefit from sidestepping the requirement to hold all the data in memory on one machine.


How much memory do you have? Unless you have enough memory to keep most of the data in memory its going to be so slow, it may as well have failed. A program which is heavily paging can be 1000x slower or more. Some PC have 16-24 GB and you might consider getting more memory.

Lets assume there is enough duplicates, you can keep most of the data in memory. I suggest you use a byte based String class of your own making, since you have ASCII data and your store your values as another of these "String" types (with a separator) You may find you can keep the working data set in memory.


I use BerkleyDB for this, though it is more complicated than a Map (though they have a Map wrapper which I don't really recommend for anything but simple applications)

http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html

It is also available in Maven http://www.oracle.com/technetwork/database/berkeleydb/downloads/maven-087630.html

  <dependencies>
    <dependency>
      <groupId>com.sleepycat</groupId>
      <artifactId>je</artifactId>
      <version>3.3.75</version>
    </dependency>
  </dependencies>

  <repositories>
    <repository>
      <id>oracleReleases</id>
      <name>Oracle Released Java Packages</name>
      <url>http://download.oracle.com/maven</url>
      <layout>default</layout>
    </repository>
  </repositories>

It also has one other disadvantage of vendor lock-in (i.e. you are forced to use this tool. though there may be other Map wrappers to some other databases)

So just choose according to your needs.


Most cache-APIs work like maps and support overflow to disk. Ehcache for example supports that. Or follow this tutorial for guave.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜