How to reduce memory usage for a HashMap<String, Integer> like data structure

Dorukhan Arslan :

Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.

I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.

I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.

Oleg Sklyar :

Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:

You already pointed out two individual subproblems of your overall task:

the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings

The map implementation

I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).

As for packing the values, here are some considerations:

for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:

use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control

The keys

Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].