开发者

Algorithms: random unique string

I need to generate string that meets the following requirements:

  1. it should be a unique string;
  2. string length should be 8 characters;
  3. it should contain开发者_开发百科 2 digits;
  4. all symbols (non-digital characters) should be upper case.

I will store them in a data base after generation (they will be assigned to other entities).

My intention is to do something like this:

  1. Generate 2 random values from 0 to 9—they will be used for digits in the string;
  2. generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
  3. concatenate everything into one string;
  4. check if the string already exists in the data base; if not—repeat.

My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).

Question: could you please give advice on how to improve this algorithm to be more deterministic?

Thanks.


  1. it should be unique string;
  2. string length should be 8 characters;
  3. it should contains 2 digits;
  4. all symbols (non-digital characters) - should be upper case.

Assuming:

  • requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
  • the "symbols" in requirement #4 are the 26 capital letters A through Z
  • you would like an evenly-distributed random string

Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:

  1. Generate two different integers between 0 and 7, and sort them.
  2. Generate 2 random numbers from 0 to 9.
  3. Generate 6 random letters from A to Z.
  4. Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
  5. Put the 6 random letters in the remaining positions.

There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.


edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)

This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:

  1. Repeat the following step 6 times:
  2. Take X mod 26, generate a capital letter, and set X = X / 26.
  3. Take X mod 100 to generate your two digits, and set X = X / 100.
  4. X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
  5. Follow the random approach above, but use the numbers generated by this algorithm instead.

Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)

But this isn't something to get into unless you know what you're doing.


Are these user passwords? If so, there are a couple of things you need to take into account:

  1. You must avoid 0/O and I/1, which can easily be mistaken for each other.
  2. You must avoid too many consecutive letters, which might spell out a rude word.

As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).

If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.


I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:

for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
  for numbers in ( 00 .. 99 ) {
    string = letters + numbers
  }
}

This will create unique strings eight characters long, with two digits and six upper-case letters.

If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.


I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.

Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.

I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)


Do it the other way around: generate one big random number that you will split up to obtain the individual characters:

 long bigrandom = ...;
 int firstDigit = bigRandom % 10;
 int secondDigit = ( bigrandom / 10 ) % 10;

and so on.

Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.

However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.

What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.


For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.

If 'random' is a requirement, you can do a few improvements.

  1. Store string as a number in database. Not sure how much this improves perfromance.
  2. Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.

E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only) It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.


The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.

The simplest solution is to precalculate your codes.

Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.

  • You don't need to calculate any random numbers and generate strings for each user (already done)
  • You don't need to check whether anything has already been used, just get the next available
  • No chance of getting multiple collisions before finding something usable.

If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜