开发者

Why might a System.String object not cache its hash code?

A glance at the source code for string.GetHashCode using Reflector reveals the following (for mscorlib.dll version 4.0):

public override unsafe int GetHashCode()
{
    fixed (char* str = ((char*) this))
    {
        char* chPtr = str;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
             开发者_Go百科   break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}

Now, I realize that the implementation of GetHashCode is not specified and is implementation-dependent, so the question "is GetHashCode implemented in the form of X or Y?" is not really answerable. I'm just curious about a few things:

  1. If Reflector has disassembled the DLL correctly and this is the implementation of GetHashCode (in my environment), am I correct in interpreting this code to indicate that a string object, based on this particular implementation, would not cache its hash code?
  2. Assuming the answer is yes, why would this be? It seems to me that the memory cost would be minimal (one more 32-bit integer, a drop in the pond compared to the size of the string itself) whereas the savings would be significant, especially in cases where, e.g., strings are used as keys in a hashtable-based collection like a Dictionary<string, [...]>. And since the string class is immutable, it isn't like the value returned by GetHashCode will ever even change.

What could I be missing?


UPDATE: In response to Andras Zoltan's closing remark:

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

Whoa, whoa there! This is an interesting point to make (and yes it's very true), but I really doubt that this was taken into consideration in the implementation of GetHashCode. The statement "therefore to cache the result would be wrong" implies to me that the framework's attitude regarding strings is "Well, they're supposed to be immutable, but really if developers want to get sneaky they're mutable so we'll treat them as such." This is definitely not how the framework views strings. It fully relies on their immutability in so many ways (interning of string literals, assignment of all zero-length strings to string.Empty, etc.) that, basically, if you mutate a string, you're writing code whose behavior is entirely undefined and unpredictable.

I guess my point is that for the author(s) of this implementation to worry, "What if this string instance is modified between calls, even though the class as it is publicly exposed is immutable?" would be like for someone planning a casual outdoor BBQ to think to him-/herself, "What if someone brings an atomic bomb to the party?" Look, if someone brings an atom bomb, party's over.


Obvious potential answer: because that will cost memory.

There's a cost/benefit analysis here:

Cost: 4 bytes for every string (and a quick test on each call to GetHashCode). Also make the string object mutable, which would obviously mean you'd need to be careful about the implementation - unless you always compute the hash code up-front, which is a cost of computing it once for every string, regardless of whether you ever hash it at all.

Benefit: Avoid recomputing the hash for string values hashed more than once

I would suggest that in many cases, there are many, many string objects and very few of them are hashed more than once - leading to a net cost. For some cases, obviously that won't be the case.

I don't think I'm in a good position to judge which comes up more often... I would hope that MS has instrumented various real apps. (I'd also hope that Sun did the same for Java, which does cache the hash...)

EDIT: I've just spoken to Eric Lippert about this (NDC is awesome :) and basically it is about the extra memory hit vs the limited benefits.


Firstly - there's no knowing if caching this result would actually improve Dictionary<string, ...> et al because they don't necessarily use String.GetHashCode, because it uses an IComparer to get the hashcode for a string.

And if you follow the likely call chain for the StringComparer class, it ends up going through to the System.Globalization.CompareInfo class, which finally terminates at this method:

[SecurityCritical, SuppressUnmanagedCodeSecurity, DllImport("QCall",
   CharSet=CharSet.Unicode)]
private static extern int InternalGetGlobalizedHashCode(IntPtr handle, string
   localeName, string source, int length, int dwFlags);

There's no knowing if that library - which appears to be a native method - doesn't use some form of internal caching based on the underlying .Net object data structure that we can't get at once inside the .Net runtime.

However, the important thing to note with this is that one string can have many different hash codes based on how you chose to interpret the characters. Granted, this implementation is culture-inspecific - which is why it's unsuitable for these comparers.

So, whilst the additional memory storage could be a factor, I actually think it's because to store a hash code along with an instance of the string misleads the caller, and indeed the .Net internal dev team(!), into thinking that the string only has one hash code, when in fact it entirely depends on how you're going to interpret it - as a series of bytes (which most of us do not), or as a series of printable characters.

From a performance point of view, then, if we also accept that these comparers used by Dictionary<,> etc can't be using the internal implementation, not caching this result probably doesn't have much of an impact because, frankly, how often will this method actually get called in the real world: since most of the time a hashcode of a string is most likely calculated via some other mechanism.

EDIT

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

AN ADDITIONAL EDIT(!)

Dan makes the point that strings are meant to be immutable within the Net sphere and therefore that string should be free to cache it's own hashcode based on this. The problem here is that the .Net framework also provides a legitimate way to change the supposedly immutable string that does not involve privileged reflection or anything else. It's a fundamental problem with strings, it's a pointer to a buffer that you cannot control. Never mind in the C# world, what about in C++, where vectoring over and modifying memory buffers is common-place. Just because you ideally shouldn't do it doesn't mean that the framework should expect you not to.

.Net happens to provide this functionality, and therefore if this was a design decision by the .Net team in response to the kind of binary thuggery suggested by Tim, then they were very wise to have taken it into account. Whether they did, or whether it is by fluke, is another matter entirely! :)


I may have made a wrong conclusion here, but isn't it true that while the string is immutable in the context of a .NET String object, it's still possible to change the value?

For instance, if you were so inclined to do this...

String example = "Hello World";

unsafe
{
    fixed (char* strPointer = myString) {
        strPointer[1] = 'a';
    }
} 

...wouldn't example still represent the same String object, but now with a value that would compute a different value for GetHashCode()? I may be off-base here, but since you could easily (if not pointlessly) do this, that would cause some issues as well.


One more potential reason for this is that interned strings (specifically those that are added as shared readonly data by the compiler) can have exactly the same format as any other string. The fact that these strings are loaded into readonly memory means that those data pages can be shared easily across process, but that the it would not be possible to also have them cache a hashcode.

But as others have mentioned, the primary reason for not caching the value is that the additional memory usage is likely to far outweigh the potential savings of hashcode caching. The execution time of GetHashCode is O(N) on the length of the string so the worst case scenario of repeated hashing is well bounded.


Yes, it will cost memory, but what's more important, it will cost memory even if you don't use this feature.

May be it would be beneficial to have a hashcode-optimized string implementation in framework.

Anyway, it should be trivial to implement your own:

public sealed class InternedString : IEquatable<InternedString>
{
    public InternedString(string s) => String = string.Intern(s);

    public string String { get; }

    public override bool Equals(object obj) => String.Equals(obj);

    public bool Equals(InternedString other) => String.Equals(other?.String);

    public override int GetHashCode() => RuntimeHelpers.GetHashCode(String);

    public static bool operator ==(InternedString l, InternedString r) =>
        l?.String == r?.String;

    public static bool operator !=(InternedString l, InternedString r) => !(l == r);
}

The idea here is to make sure that each wrapped string is interned, so we can rely on the string references of the same strings inside InternedString to be always the same. This approach optimizes both GetHashCode and Equals calls, making this class a perfect candidate for a Dictionary key.

The drawback is the cost of interning. Using it everywhere is an overkill. Typical usage scenario is a Dictionary with a few, but very long string keys.

UPD:

BTW, I have packaged it, and added a benchmark, check it out.


Any int value is a valid HashCode. This means there is no default int value like -1 or 0 that we can use to indicate that we haven't computed the HashCode yet. So if a string were to cache its HashCode, it would need to do one of the following:

  • Have an int field for the HashCode, plus a bool field to serve as a flag for whether the HashCode has been computed yet, and then only compute the HashCode the first time it's requested (lazy evaluation), or
  • Have an int field for the HashCode, and always compute the HashCode when the string is constructed.

Both choices have a drawback; the first requires yet more additional memory, and the second has the performance cost of computing HashCodes that may never be needed.

Now consider the case of Dictionary<TKey,TValue>. The HashCode used by Dictionary depends upon which comparer is being used. The default comparer will use the object's normal GetHashCode() method. But you could create a Dictionary that uses a case insensitive comparer for example, and the HashCode used by Dictionary will be produced by that comparer, which is likely to produce an entirely different HashCode than String.GetHashCode(). So which HashCode does the string cache? A string might be in two Dictionaries, with each using a different comparer, neither of which uses the normal string GetHashCode. So the string could be caching a HashCode none of the Dictionaries even use.

In the case of Dictionary<TKey,TValue>, there is an even more important reason that having strings cache their HashCodes will likely provide no performance benefit. The internal implementation of Dictionary does the following when a new entry is added:

  • Computes the HashCode of the key using the GetHashCode() method of the equality comparer provided at construction, or the default comparer if none was specified.
  • Strips the sign bit off the HashCode
  • Stores the new entry, which consists of the modified HashCode from above, the key, the value, and the index of the next entry in the list of entries that map to the same bucket.

When the Dictionary does a Key lookup, it computes the modified (i.e. positive) HashCode of the key being searched for, gets the bucket that HashCode maps to, then looks through the list of entries in that bucket. To check if an entry is a match, it first checks if the modified HashCodes match (if the keys are equal, the HashCodes must be equal too), and if they are equal, checks if the two keys are equal as well. In the case of strings, this algorithm achieves two things; first, it avoids many string comparisons by using a simple integer compare first to see if it's worth doing a string compare, and second, it caches the HashCodes of every key in the Dictionary. The HashCode of each key in the Dictionary is computed only once, when the key/value pair are added to the Dictionary.

(If you're wondering why Dictionary strips the sign bit from the HashCode, it's because it uses a -1 as a marker flag value in the hashCode field for entry slots that are currently empty.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜