Improving counting of word frequencies with hashmap

2023-01-28 22:08 问答作者：

For one of my applications, the following function has to be called very often. This function takes up a lot of CPU and henc开发者_如何学Pythone I am wondering if you know how to improve the performance.

The code counts the occurrences of a combination of four characters. During testing, I found that the number of entries in the map are around 100. The length of text is in the range of 100 to 800. The initial size of 200 is a guess and the code seems to run faster than without specifying an initial size. It is probably not the optimal value, though.

private Map<String, Integer> getTetagramCount(final String text) {
    final Map<String, Integer> cipherTetagrams = new HashMap<String, Integer>(200);

    for (int i = 0; i < text.length() - 4; i++) {
        final String tet = text.substring(i, i + 4);

        final Integer count = cipherTetagrams.get(tet);
        if (count != null) {
            cipherTetagrams.put(tet, count + 1);
        } else {
            cipherTetagrams.put(tet, 1);
        }
    }

    return cipherTetagrams;
}

I do a lot of work in NLP and machine learning, so I have to do this sort of thing all the time, and there are a ton of opportunities for optimization.

A few points to consider:

First of all, you're getting killed by the standard JDK HashMap class. It's a nice container for general-purpose computing, but it's terrible for high-performance computing. For each entry into your collection (a four-character string (8 bytes) and an integer (4 bytes), the standard java HashMap will consume:
- A string object
  - 8-byte object overhead
  - 4-byte reference to an array
  - 4-byte string-length field
- A character array
  - 8-byte object overhead
  - 2-bytes for each character (times 4 characters) = 8 bytes
  - 4-byte array-length field
- An Integer object
  - 8-byte object overhead
  - 4-byte int value
- A HashMap.Entry object
  - 8-byte object overhead
  - 4-byte Key reference
  - 4-byte Value reference
So your tiny little 12 bytes of data becomes 64 bytes. And that's before the HashMap has allocated an array of hash values to use during its lookup operations. Keep in mind that all those tiny little objects mean more work for the GC, but more importantly, it means that your objects span a larger swath of main memory and are less likely to fit within the CPU cache. When you have lots of cache misses, you lose performance.

NOTE: A commenter reminded me that all the substrings will share the same underlying character array, which is a good point that I'd forgotten about. But still, that means each map entry goes from being 64 bytes down to 44 bytes. Which is still a shame, when there should only be 12 bytes.
Boxing and unboxing all those integer values causes your code to run more slowly and to consume more memory. In most cases, we don't really care about that, and the vanilla HashMap implementation is fine, even with its mandatory boxing and greedy memory consumption. But in your case, if this code is run within a tight inner loop, we'd rather have a specialized class that knows its values will always be integers and eliminates the need for boxing.
If you dig down into the JDK source code, you'll see that your code will end up invoking string's hashCode() and equals() methods twice. Once for the map.get() and once for the map.put(). But there's a different kind of collection called a HashBag that can perform lookup, insertion, and count incrementation with only one lookup. A "bag" collection is kind of like a "set", except that it can contain duplicates, and it keeps track of how many duplicates there are. For each of your tetragrams, you'd just call bag.put(tetragram) without having to first retrieve and update a count. Unfortunately, there are no bag implementations in the JDK, so you'll need to find one elsewhere, or write one yourself.
Luckily, your tetragrams can be losslessly encoded as long values (since each java character is 2 bytes wide, and a long gives you eight bytes to work with). You can therefore iterate through the character array and convert each tetragram to a long, and avoid all the overhead of constructing so many tiny strings. Then, you can keep your results in a LongIntHashMap (from the Trove library). This will be much faster than your current implementation, because you can avoid creating all those tiny little string objects.
Although the Trove LongIntHashMap is quite excellent, it isn't quite as good as a LongHashBag would be. There's no equals invocation (since longs can be compared with the == operator), but you'll still pay the price of two hashCode invocations. If you want to get really aggressive with optimization, you could look at the source code of the LongIntHashMap and figure out how to modify it into a LongHashBag. It's not that difficult, and ultimately, that's exactly what I've done in my own code.

Update 1:

Okay, here's a little bit of code:

private LongHashBag countTetragrams(String text) {

  // Homework assignment: find a good LongHashBag implementation, or
  // grab the LongIntHashMap implementation from Trove, and tweak it
  // to work as a Bag
  LongHashBag bag = new LongHashBag(500);

  // There are no tetragrams in this string.
  if (text.length() < 4) return bag;

  // Shortcut: if we calculate the first tetragram before entering
  // the loop, then we can use bit-shifting logic within the loop
  // to create all subsequent tetragram values.
  char[] c = text.toCharArray();
  long tetragram = ((long) c[0] << 48) |
     (((long) c[1]) << 32) |
     (((long) c[2]) << 16) |
     ((long) c[3]);

  bag.add(tetragram);

  for (int i = 4, last = text.length(); i < last; i++) {
     // During each loop iteration, the leftmost 2-bytes are shifted
     // out of the tetragram, to make room for the 2-bytes from the
     // current character.
     tetragram = (tetragram << 16) | ((long) c[i]);
     bag.add(tetragram);
  }

  return bag;
}

Update 2:

I just did some testing of various solutions, and I was about to get about a 25% performance improvement using the LongHashBag approach instead of the standard HashMap approach.

However, I was about to get a 300% improvement by recycling the resultant objects. Basically, instead of doing this:

private LongHashBag countTetragrams(String text) {

  // Creates a new HashBag on every invocation. Very wasteful.
  LongHashBag bag = new LongHashBag(500);

  // ...blah blah blah...

  return bag;
}

...I'm now doing this...

private void countTetragrams(String text, LongHashBag bag) {

  // Return the object to a neutral state, and recycle it.
  bag.clear()

  // ...blah blah blah...
}

The calling code is responsible for creating the LongHashBag object and ensuring that we're done with it by the time we call the count method again.

But it would also work to do this...

private LongHashBag countTetragrams(String text) {

  // Return the object to a neutral state, and recycle it.
  LongHashBag bag = retrieveLongHashBagFromObjectPool();

  // ...blah blah blah...
  return bag;
}

... which will add a little bit of overhead for maintaining the pool. And the calling code would have to remember to put the bag back into the pool when it finishes using it. But the performance benefits could definitely be worth it.

Incidentally, these are exactly the kinds of tricks I use every day. Object pooling has become one of my most reliable tricks for improving performance.

But like I said, recycling those objects yields a 300% performance improvement.

You might try implementing a prefix tree (trie) as a data structure, particularly if you know the range of the characters. It would be at most 4 levels deep, giving you potentially constant (and faster constant) time. How this will perform compared to the hashmap really depends on the data you have.

Edit

Alternatively, again if you know the range of the characters, you could just stuff them into a much faster data type.

Since you know all of your characters are between A and Z or 0 and 9, you can compact that into 6 bits each:

 public int index(String str, int startPos) {
     return 
    ((str.charAt(startPos+3) - '0') << 18) + 
    ((str.charAt(startPos+2) - '0') << 12) + 
    ((str.charAt(startPos+1) - '0') << 6) + 
     (str.charAt(startPos) - '0');
 }

 //...    
 int[] counts = new int[42*42*42*42];
 final int max = text.length() - 4;
 for ( int i = 0; i < max; i++ ) {
     counts[index(text, i)]++;
 }

Edit: updated the example above to cover A-Z, 0-9. Now note two things: First, you have to create a big array, but you don't need to do that every time (you do have to clear it every time though!). Second, this provides really fast lookup of the number of occurrences of a certain word, but if you want to iterate over all the words (say to find all words that actually occurred), that takes O(42^4) time.

Well, one potential option is to change from using an immutable wrapper type to a mutable one:

public final class Counter
{
    private int value;

    public int getValue()
    {
        return value;
    }

    public void increment()
    {
        value++;
    }
}

Then change your code to:

private Map<String, Counter> getTetagramCount(final String text) {
    final Map<String, Counter> cipherTetagrams = new HashMap<String, Counter>(200);

    // Micro-optimization (may well not help) - only take the
    // length and subtract 4 once
    int lastStart = text.length() - 4;
    for (int i = 0; i < lastStart; i++) {
        final String tet = text.substring(i, i + 4);

        Counter counter = cipherTetagrams.get(tet);
        if (counter == null) {
            counter = new Counter();
            cipherTetagrams.put(tet, counter);
        }
        counter.increment();
    }

    return cipherTetagrams;
}

That way you only ever "put" a value associated with a word once... after that you increment it in-place.

(You could potentially use AtomicInteger instead of Counter if you wanted to use a built-in type.)

Big-O optimization aside (if there are any), there is a very simple way to greatly speedup your application: use something than the default Java APIs, which are very slow when it comes to dealing with a lot of data.

Replace:

Map<String, Counter>

With Trove's (which means you have to add the Trove jar to your project):

TObjectIntHashMap<String>

And:

final Integer count = cipherTetagrams.get(tet);

with:

final int count = cipherTetagrams.get(tet);

Because when you work with a lot of data, using wrappers like Integer instead of primitives (like int) and using the default Java API is the surest way to shoot yourself in the foot.

I haven't even begun to analyze your code and I noticed that this method does not operate on any member fields and can be made static. Static methods will always perform better than non-static methods. I'll look for more issues in a minute...

I'm not sure if this would be faster, but I have a feeling it would be.

private Map<String, Integer> getTetagramCount( final String text) {

    final List<String> list = new ArrayList<String>();

    for( int i =0; i < text.length() - 4; i++) {
        list.add( text.substring( i, i+4);
    }

    Collections.sort( list);

    Map<String, Integer> map = new HashMap<String, Integer>( list.size());
    String last = null;
    int count = 0;
    for( String tetagram : list) {
        if( tetagram != last && last != null) {
            map.put( tetagram, count);
            count = 0;
        }
        count++;
        last = tetagram;
    }
    if( tetagram != null) {
        map.put( tetagram, count);
    }
    return map;
}

Depending on what you are doing with the Map when you are done, you may not need the conversion to a Map at the end.

继续阅读：algorithm count hashmap performance

Improving counting of word frequencies with hashmap

Update 1:

Update 2:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Update 1:

Update 2:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？