开发者

How do I use a MD5 hash (or other binary data) as a key name?

I've been trying to use a MD5 hash as a key name on AppEngine, but the code I wrote raises a UnicodeDecodeError

from google.appengine.ext import db
import hashlib
key = db.Key.from_path('Post', ha开发者_开发问答shlib.md5('thecakeisalie').digest())

I don't want to use hexdigest() as that is not only a kludge, but an inferior one too (base64 would do a better job).


The App Engine Python docs says:

A key_name is stored as a Unicode string (with str values converted as ASCII text).

The key has to be an unicode-encodeable-string. You need to change the digest() call to hexdigest(), ie:

k = hashlib.md5('thecakeisalie').hexdigest()


decode the bytestring with iso-8859-1

>>> hashlib.md5('thecakeisalie').digest().decode("iso-8859-1")
u"'\xfc\xce\x84h\xa9\x1e\x8a\x12;\xa5\xb1K\xea\xef\xd6"

This is basically a "NOP" conversion. It creates a unicode object that is the same length as the initial string and can be converted back to a string just by .encode("iso-8859-1") if you wish


Let's think about data sizes. The optimal solution here is 16 bytes:

>>> hashlib.md5('thecakeisalie').digest() 
"'\xfc\xce\x84h\xa9\x1e\x8a\x12;\xa5\xb1K\xea\xef\xd6"

>>> len(hashlib.md5('thecakeisalie').digest())
16

The first thing you thought of was hexdigest, but it's not very close to 16 bytes:

>>> hashlib.md5('thecakeisalie').hexdigest() 
'27fcce8468a91e8a123ba5b14beaefd6'

>>> len(hashlib.md5('thecakeisalie').hexdigest())
32

But this won't give you ascii-encodable bytes, so we have to do something else. The simple thing to do is use the python representation:

>>> repr(hashlib.md5('thecakeisalie').digest())
'"\'\\xfc\\xce\\x84h\\xa9\\x1e\\x8a\\x12;\\xa5\\xb1K\\xea\\xef\\xd6"'

>>> len(repr(hashlib.md5('thecakeisalie').digest()))
54

We can get rid of a bunch of that by removing the "\x" escapes and the surrounding quotes:

>>> repr(hashlib.md5('thecakeisalie').digest())[1:-1].replace('\\x','')
"'fcce84ha91e8a12;a5b1Keaefd6"

>>> len(repr(hashlib.md5('thecakeisalie').digest())[1:-1].replace('\\x',''))
28

That's pretty good, but base64 does a little better:

>>> base64.b64encode(hashlib.md5('thecakeisalie').digest())
J/zOhGipHooSO6WxS+rv1g==
>>> len(base64.b64encode(hashlib.md5('thecakeisalie').digest()))
24

Overall, base64 is most space-efficient, but I'd just go with hexdigest as it's likely to be most optimized (time-efficient).


Gnibbler's answer gives a length of 16!

>>> hashlib.md5('thecakeisalie').digest().decode("iso-8859-1")
u"'\xfc\xce\x84h\xa9\x1e\x8a\x12;\xa5\xb1K\xea\xef\xd6"
>>> len(hashlib.md5('thecakeisalie').digest().decode("iso-8859-1"))
16


I find using a base64 encoding of the binary data a reasonable solution. Based on your code you could do something like:

import hashlib
import base64
print base64.b64encode(hashlib.md5('thecakeisalie').digest())


An entity key in App Engine can have either an ID (a 4 byte integer), or a name (500 byte UTF-8 encoded string).

An MD5 digest is 16 bytes of binary data: too large for an integer, (likely to be) invalid UTF-8. Some form of encoding must be used.

If hexdigest() is too verbose at 32 bytes then try base64 at 24 bytes.

Whatever encoding scheme you use will ultimately be converted to UTF-8 by the datastore, so the following, which at first looks like an optimal encoding...

>>> u = hashlib.md5('thecakeisalie').digest().decode("iso-8859-1")
>>> len(u)
16

...when encoded into it's final representation is two bytes longer than the base64 encoding:

>>> s = u.encode('utf-8')
>>> len(s)
26
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜