开发者

String to unique int algorithm

We are trying to implement the following case. We have a invoice table and there is a column which has email address. We 开发者_如何学Pythonwant to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well


Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.


It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.

Some things to keep in mind:

  • convert everything to lower/uppercase before computing the hashcode (so USER@DOMAIN.COM gets the same hashcode as user@domain.com)

  • apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk


MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)


I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.

So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.

Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.

Sorry to be of so little help.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜