String to unique int algorithm
We are trying to implement the following case. We have a invoice table and there is a column which has email address. We 开发者_如何学Pythonwant to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well
Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.
It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.
Some things to keep in mind:
convert everything to lower/uppercase before computing the hashcode (so USER@DOMAIN.COM gets the same hashcode as user@domain.com)
apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk
MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int
datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)
I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.
So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.
Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.
Sorry to be of so little help.
精彩评论