Creating a hash of a string thats sortable

2022-12-15 22:29 问答作者：

Is there anyway to create hashs of string开发者_如何学JAVAs where the hashes can be sorted and have the same results as if the strings themselves were sorted?

This won't be possible, at least if you allow strings longer than the hash size. You have 256^(max. string size) possible strings mapped to 256^(hash size) hash values, so you'll end up with some of the strings unsorted.

Just imagine the simplest hash: Truncating every string to (hash size) bytes.

Yes. It's called using the entire input string as the hash.

As others have pointed out it's not practical to do exactly what you've asked. You'd have to use the string itself as the hash which would constrain the lengths of strings that could be "hashed" and so on.

The obvious approach to maintaining a "sorted hash" data structure would be to maintain both a sorted list (heap or binary tree, for example) and a hashed mapping of the data. Inserts and removals would be O(log(n)) while retrievals would be O(1). Off hand I'm not sure how often this would be worth the additional complexity and overhead.

If you had a particularly large data set, mostly read-only and such that logarithmic time retrieval was overly expensive then I suppose it might be useful. Note that the cost of updates is actually the sum of the constant time (hash) and the logarithmic time (binary tree or heap) operations. However O(1) + O(log(n)) reduces to the larger of the two terms during asymptotic analysis. (The underlying cost is still there --- relevant to any implementation effort regardless of its theoretical irrelevance).

For a significant range of data set sizes the cost of maintaining this hypothetical hybrid data structure could be estimated as "twice" the cost of maintaining either of the pure ones. (In other words many implementations of a binary tree over can scale to billions of elements (2^~32 or so) in time cost that's comparable to the cost of the typical hash functions). So I'd be hard-pressed to convince myself that such added code complexity and run-time cost (of a hybrid data structure) would actually be of benefit to a given project.

(Note: I saw that Python 3.1.1 added the notion of "ordered" dictionaries ... and this is similar to being sorted, but not quite the same. From what I gather the ordered dictionary preserves the order in which elements were inserted to the collection. I also seem to remember some talk of "views" ... objects in the language which can access keys of a dictionary in some particular manner (sorted, reversed, reverse sorted, ...) at (possibly) lower cost than passing the set of keys through the built-in "sorted()" and "reversed()." I haven't used these nor have a looked at the implementation details. I would guess that one of these "views" would be something like a lazily evaluated index, performing the necessary sorting on call, and storing the results with some sort of flag or trigger (observer pattern or listener) that's reset when the back-end source collection is updated. In that scheme a call to the "view" would update its index; subsequence calls would be able to use those results so long as no insertions nor deletions had been made to the dictionary. Any call to the view subsequent to key changes would incur the cost of updating the view. However this is all pure speculation on my part. I mention it because it might also provide insight into some alternative ways to approach the question).

Not unless there are fewer strings than hashes, and the hashes are perfect. Even then you still have to ensure the hash order is the same as the string order, this is probably not possible unless you know all the strings ahead of time.

No. The hash would have to contain the same amount of information as the string it is replacing. Otherwise, if two strings mapped to the same hash value, how could you possibly sort them?

Another way of thinking about it is this: If I have two strings, "a" and "b", then I hash both of them with this sort preserving hash function and get f(a) and f(b). However, there are an infinite number of strings that are greater than "a" but less than "b". This would require hashing the strings to arbitrary precision Real values (because of cardinality). In the end, you would basically just have the string encoded as a number.

You're essentially asking if you can compress the key strings into smaller keys while preserving their collation order. So it depends on your data. If your strings are composed of only hexadecimal digits, for example, they can be replaced with 4-bit codes.

But for the general case, it can't be done. You'd end up "hashing" each source key into itself.

I stumble upon this, and although everyone is correct with their answers, I needed a solution exactly like this to use in elasticsearch (don't ask why). Sometimes we don't need a perfect solution for all cases, we just need one to work with the constraints that are acceptable. My solution is able to generate a sortable hashcode for the first n chars of the string, I did some preliminary tests and didn't have any collisions. You need to define beforehand the charset that is used and play with n to a deemed acceptable value of the first chars needed to sort and try to maintain the result hash code in the positive interval of the defined type for it to work, in my case, for Java Long type I could go up to 13 chars. Below is my code in Java, hopefully, it will help someone else that needs this.

String charset = "abcdefghijklmnopqrstuvwxyz";

public long orderedHash(final String s, final String charset, final int n) {
  Long hash = 0L;

  if(s.isEmpty() || n == 0)
    return hash;

  Long charIndex = (long)(charset.indexOf(s.charAt(0)));
  if(charIndex == -1)
    return hash;

  for(int i = 1 ; i < n; i++)
    hash += (long)(charIndex * Math.pow(charset.length(), i));

  hash += charIndex + 1 + orderedHash(s.substring(1), charset, n - 1);

  return hash;
}

Examples:

orderedHash("a", charset, 13)              // 1
orderedHash("abc", charset, 13)            // 4110785825426312
orderedHash("b", charset, 13)              // 99246114928149464
orderedHash("google", charset, 13)         // 651008600709057847
orderedHash("stackoverflow", charset, 13)  // 1858969664686174756
orderedHash("stackunderflow", charset, 13) // 1858969712216171093
orderedHash("stackunderflo", charset, 13)  // 1858969712216171093 same, 13 chars limitation 
orderedHash("z", charset, 13)              // 2481152873203736576
orderedHash("zzzzzzzzzzzzz", charset, 13)  // 2580398988131886038
orderedHash("zzzzzzzzzzzzzz", charset, 14) // -4161820175519153195 no good, overflow
orderedHash("ZZZZZZZZZZZZZ", charset, 13)  // 0 no good, not in charset

If more precision is needed, use an unsigned type or a composite one made of two longs for example and compute the hashcode with substrings.

Edit: Although the previously algorithm sufficed for my use I noticed that it was not really ordering correctly the strings if they didn't have a length bigger that the chosen n. With this new algorithm it should be ok now.

继续阅读：hash language-agnostic sorting

Creating a hash of a string thats sortable

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？