开发者

Calculating a "based" data checksum. (SHA1/2, etc)

I'm not sure exactly how to ask this, but here's what I'm hoping for, given a structure that could contain 5+n keys (thus, there are 5 keys mandatory to my system, additional keys are optional) - I would like a hashing mechanism that is able to determine that a 6 key hash, with 5 identical keys, is a superset of the 5 key struct, and offers additional information. Specifically a hashing mechanism, as there are constraints which preclude sending the complete struct over the w开发者_运维技巧ire on every request.

For clarification, here's some information (sample requires 2+n keys):

---
  name: codebeaker
  occupation: developer

Hashed with SHA-512, and -256 this comes out to look like:

SHA-512
04fe500f2b3e779aba9ecb171224a04d35cc8453eb1521c7e31fd48b56b1cce9
b1e8af775e177e110982bfb16a6ca8652d7d9812ab8a8c316015dc9d6b3b54f7

SHA-256
4833be7086726e7ffd82db206f94f0a4f9fdf7fba00692f626157afed4587c74

When adding an additional key, (example below) I would like to be able to deduce that the extended dataset is a superset of the first.

---
  name: codebeaker
  occupation: developer
  telephone: 49 (0) 123 45 67

However, unsurprisingly, in MD5, SHA-n and any other hashing function I have looked into, there's no way to do this, example:

SHA-512
2fe2c1f01e39506010ea104581b737f95db6b6f71b1497788afc80a4abe26ab0
fc4913054278af69a89c152406579b7b00c3d4eb881982393a1ace83aeb7b6a2

SHA-256
77c2942e9095e55e13c548e5ef1f874396bfb64f7653e4794d6d91d0d3a168e2

(Obviously) there are no similarities...

Our use case, this data, formatted as a struct is fed into our system by a 3rd party. Processing the data is hugely expensive, 2-3 seconds per operation, we can get about 50% of that time back, if we know we have a result from a previous run, however - Bayesian, and Levenstein text-difference algorithms aren't suitable here, as we often see key/value pairs that are acronyms, and other text which can appear similar, when being completely unrelated.

What we need is a way to checksum data (I might be biasing my response here) - so that we can determine that B is a superset of A if it contains all the same keys, with the same data. However, often there is so much data in the key/value entries in our struc that sending it over the wire every time, only to determine that we already saw a more complete copy, would be expensive and wasteful.


An idea would be to use different hashes per key-value pair. The "hash" of the complete struct is therefore a collection of hashes.

If your use case is always five identical keys in the same order and then any additional keys you could use one hash for the mandatory keys and one for the optional keys - but you would then be unable to detect that one struct containing optional keys is the superset of another struct containing optional keys.

A slight variation is to use one hash for the required keys and one for the entire struct.

You could also (depending on your requirements) use smaller checksums for the key-value pairs to be able to quickly discard something as not being the same - but larger hashes would still be needed to more accurately determine that something is a match.


Cryptographic hashes are specifically designed with these properties:

  • They are one-way functions. It is practically infeasible to recalculate a specific input for a given hash value, or even any random input that hashes to this value.
  • Though there must be collisions because the input size if much larger than the fixed output size, it is also practically infeasible to find two different input values that result in the same hash value.
  • The exactly same input value always hashes to exactly the same hash value.
  • Any small change in the input results in a completely different hash value. Flipping any single input bit changes 50 percent on average of the output bits.

Thus a cryptographic hash can and actually is used as a unique identifier for any binary data. Even "name: codebeaker" has a different hash than "name: Codebeaker".

If your set of keys is fixed, in a fixed order, always complete and only extended by new keys, and each key only has one allowed representation, then you can calculate the hash of the five old keys and compare it to the existing hashes of the current sets.

If the keys are always unique, but the sets can be mixed, then you can calculate a separate hash for each key and store and search these for the existing sets in a separate database.

Beyond this, cryptographic hashes may not be the right tool for the job.

[Edit]

Another approach is to first alphabetically sort the keys and then take the hash value from the sorted set. This now identifies your set without needing to care for the order. It may be more practical to first take the individual hashes of the single keys, sort the hashes instead and take the hash over the list of sorted hashes. This still requires uniques keys.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜