Generate all combinations of a char array inside of a CUDA device kernel

2022-12-12 02:45 问答作者：

I need help please. I started to program a common brute forcer / password guesser with CUDA (2.3 / 3.0beta). I tried different ways to generate all possible plain text "candidates" of a defined ASCII char set.

In this sample code I want to generate all 74^4 possible combinations (and just output the result back to host/stdout).

$ ./combinations
Total number of combinations    : 29986576

Maximum output length   : 4
ASCII charset length    : 74

ASCII charset   : 0x30 - 0x7a
 "0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy"

CUDA code (compiled with 2.3 and 3.0b - sm_10) - combinaions.cu:

#include <stdio.h>
#include <cuda.h>

__device__ uchar4 charset_global = {0x30, 0x30, 0x30, 0x30};
__shared__ __device__ uchar4 charset[128];

__global__ void combo_kernel(uchar4 * result_d, unsigned int N)
{
 int totalThreads = blockDim.x * gridDim.x ;
 int tasksPerThrea开发者_如何学Pythond = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
 int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
 int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
 if( endIdx > N) endIdx = N;

 const unsigned int m = 74 + 0x30;

 for(int idx = myThreadIdx ; idx < endIdx ; idx += totalThreads) {
  charset[threadIdx.x].x = charset_global.x;
  charset[threadIdx.x].y = charset_global.y;
  charset[threadIdx.x].z = charset_global.z;
  charset[threadIdx.x].w = charset_global.w;
  __threadfence();

  if(charset[threadIdx.x].x < m) {
   charset[threadIdx.x].x++;

  } else if(charset[threadIdx.x].y < m) {
   charset[threadIdx.x].x = 0x30; // = 0
   charset[threadIdx.x].y++;

  } else if(charset[threadIdx.x].z < m) {
   charset[threadIdx.x].y = 0x30; // = 0
   charset[threadIdx.x].z++;

  } else if(charset[threadIdx.x].w < m) {
   charset[threadIdx.x].z = 0x30;
   charset[threadIdx.x].w++;; // = 0
  }

  charset_global.x = charset[threadIdx.x].x;
  charset_global.y = charset[threadIdx.x].y;
  charset_global.z = charset[threadIdx.x].z;
  charset_global.w = charset[threadIdx.x].w;

  result_d[idx].x = charset_global.x;
  result_d[idx].y = charset_global.y;
  result_d[idx].z = charset_global.z;
  result_d[idx].w = charset_global.w;
 }
}

#define BLOCKS 65535
#define THREADS 128

int main(int argc, char **argv)
{
 const int ascii_chars = 74;
 const int max_len = 4;
 const unsigned int N = pow((float)ascii_chars, max_len);
 size_t size = N * sizeof(uchar4);

 uchar4 *result_d, *result_h;
 result_h = (uchar4 *)malloc(size );
 cudaMalloc((void **)&result_d, size );
 cudaMemset(result_d, 0, size);

 printf("Total number of combinations\t: %d\n\n", N); 
 printf("Maximum output length\t: %d\n", max_len);
 printf("ASCII charset length\t: %d\n\n", ascii_chars);

 printf("ASCII charset\t: 0x30 - 0x%02x\n ", 0x30 + ascii_chars);
 for(int i=0; i < ascii_chars; i++)
  printf("%c",i + 0x30);
 printf("\n\n");

 combo_kernel <<< BLOCKS, THREADS >>> (result_d, N);
 cudaThreadSynchronize();

 printf("CUDA kernel done\n");
 printf("hit key to continue...\n");
 getchar();

 cudaMemcpy(result_h, result_d, size, cudaMemcpyDeviceToHost);

 for (unsigned int i=0; i<N; i++)
  printf("result[%06u]\t%c%c%c%c\n",i, result_h[i].x, result_h[i].y, result_h[i].z, result_h[i].w);

 free(result_h);
 cudaFree(result_d);
}

The code should compile without any problems but the output is not what i expected.

On emulation mode:

CUDA kernel done hit
key to continue...

    result[000000]  1000 
...
    result[000128]  5000

On release mode:

CUDA kernel done hit
key to continue...

    result[000000]  1000 
...
    result[012288]  5000

I also used __threadfence() and or __syncthreads() on different lines of the code also without success...

ps. if possible I want to generate everything inside of the kernel function . I also tried "pre" generating of possible plain text candidates inside host main function and memcpy to device, this works only with a very limited charset size (because of limited device memory).

any idea about the output, why the repeating (even with __threadfence() or __syncthreads()) ?
any other method to generate plain text (candidates) inside CUDA kernel fast :-) (~75^8) ?

thanks a million

greets jan

Incidentally, your loop bound is overly complex. You don't need to do all that work to compute the endIdx, instead you can do the following, making the code simpler.

for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)

Let's see:

When filling your charset array, __syncthreads() will be sufficient as you are not interested in writes to global memory (more on this later)
Your if statements are not correctly resetting your loop iterators:
- In z < m, then both x == m and y == m and must both be set to 0.
- Similar for w
Each thread is responsible for writing one set of 4 characters in charset, but every thread writes the same 4 values. No thread does any independent work.
You are writing each threads results to global memory without atomics, which is unsafe. There is no guarantee that the results won't be immediately clobbered by another thread before reading them back.
You are reading the results of computation back from global memory immediately after writing them to global memory. It's unclear why you are doing this and this is very unsafe.
Finally, there is no reliable way in CUDA to to a synchronization between all blocks, which seems to be what you are hoping for. Calling __threadfence only applies to blocks currently executing on the device, which can be subset of all blocks that should run for a kernel call. Thus it doesn't work as a synchronization primitive.

It's probably easier to calculate initial values of x, y, z and w for each thread. Then each thread can start looping from its initial values until it has performed tasksPerThread iterations. Writing the values out can probably proceed more or less as you have it now.

EDIT: Here is a simple test program to demonstrate the logic errors in your loop iteration:

int m = 2;
int x = 0, y = 0, z = 0, w = 0;

for (int i = 0; i < m * m * m * m; i++)
{
    printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
    if(x < m) {
        x++;
    } else if(y < m) {
        x = 0; // = 0
        y++;
    } else if(z < m) {
        y = 0; // = 0
        z++;
    } else if(w < m) {
        z = 0;
        w++;; // = 0
    }
}

The output of which is this:

x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0

继续阅读：character-encoding combinations cryptography

Generate all combinations of a char array inside of a CUDA device kernel

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？