Thread Index in Cuda
I'm running the following code on my 8000 series device (which supports CUDA):
#include <stdio.h>
__global__ void testSet(int * MyBlock)
{
unsigned int ThreadIDX= threadIdx.x+blockDim.x*blockIdx.x;
MyBlock[ThreadIDX]=ThreadIDX;
}
int main()
{
int * MyInts;
int Result[1024];
cudaMalloc( (void**) &MyInts,sizeof(int)*1024);
testSet<<<2,512>>>(MyInts);
cudaMemcpy(Result,MyInts,sizeof(int)*1024,cudaMemcpyDeviceToHost);
for(unsigned int t=0; t<1024/8;t++) {
printf("Results: %d %d %d %d %d %d %d %d\n",
Result[t], Result[t+1],Result[t+2],
Result[t+3],Result[t+4],Result[t+5],
Result[t+6],Result[t+7]);
}
return 0;
}
And I get out...
Results: 0 1 2 3 4 5 6 7
Results: 1 2 3 4 5 6 7 8
Results: 2 3 4 5 6 7 8 9
Results: 3 4 5 6 7 8 9 10
Results: 4 5 6 7 8 9 10 11
Results: 5 6 7 8 9 10 11 12
Results: 6 7 8 9 10 11 12 13
Results: 7 8 9 10 11 12 13 14
Results: 8 9 10 11 12 13 14 15
Results: 9 10 11 12 13 14 15 16
Results: 10 11 12 13 14 15 16 17
Results: 11 12 13 14 15 16 17 18
Results: 12 13 14 15 16 17 18 19
Results: 13 14 15 16 17 18 19 20
Results: 14 15 16 17 18 19 20 21
Results: 15 16 17 18 19 20 21 22
Results: 16 17 18 19 20 21 22 23
Results: 17 18 19 20 21 22 23 24
Results: 18 19 20 21 22 23 24 25
Results: 19 20 21 22 23 24 25 26
Results: 20 21 22 23 24 25 26 27
Results: 21 22 23 24 25 26 27 28
Results: 22 23 24 25 26 27 28 29
Results: 23 24 25 26 27 28 29 30
Results: 24 25 26 27 28 29 30 31
Results: 25 26 27 28 29 30 31 32
Results: 26 27 28 29 30 31 32 33
Results: 27 28 29 30 31 32 33 34
Results: 28 29 30 31 32 33 34 35
Results: 29 30 31 32 33 34 35 36
Results: 30 31 32 33 34 35 36 37
Results: 31 32 33 34 35 36 37 38
Results: 32 33 34 35 36 37 38 39
Results: 33 34 35 36 37 38 39 40
Results: 34 35 36 37 38 39 40 41
Results: 35 36 37 38 39 40 41 42
Results: 36 37 38 39 40 41 42 43
Results: 37 38 39 40 41 42 43 44
Results: 38 39 40 41 42 43 44 45
Results: 39 40 41 42 43 44 45 46
Results: 40 41 42 43 44 45 46 47
Results: 41 42 43 44 45 46 47 48
Results: 42 43 44 45 46 47 48 49
Results: 43 44 45 46 47 48 49 50
Results: 44 45 46 47 48 49 50 51
Results: 45 46 47 48 49 50 51 52
Results: 46 47 48 49 50 51 52 53
Results: 47 48 49 50 51 52 53 54
Results: 48 49 50 51 52 53 54 55
Results: 49 50 51 52 53 54 55 56
Results: 50 51 52 53 54 55 56 57
Results: 51 52 53 54 55 56 57 58
Results: 52 53 54 55 56 57 58 59
Results: 53 54 55 56 57 58 59 60
Results: 54 55 56 57 58 59 60 61
Results: 55 56 57 58 59 60 61 62
Results: 56 57 58 59 60 61 62 63
Results: 57 58 59 60 61 62 63 64
Results: 58 59 60 61 62 63 64 65
Results: 59 60 61 62 63 64 65 66
Results: 60 61 62 63 64 65 66 67
Results: 61 62 63 64 65 66 67 68
Results: 62 63 64 65 66 67 68 69
Results: 63 64 65 66 67 68 69 70
Results: 64 65 66 67 68 69 70 71
Results: 65 66 67 68 69 70 71 72
Results: 66 67 68 69 70 71 72 73
Results: 67 68 69 70 71 72 73 74
Results: 68 69 70 71 72 73 74 75
Results: 69 70 71 72 73 74 75 76
Results: 70 71 72 73 74 75 76 77
Results: 71 72 73 74 75 76 77 78
Results: 72 73 74 75 76 77 78 79
Results: 73 74 75 76 77 78 79 80
Results: 74 75 76 77 78 79 80 81
Results: 75 76 77 78 79 80 81 82
Results: 76 77 78 79 80 81 82 83
Results: 77 78 79 80 81 82 83 84
Results: 78 79 80 81 82 83 84 85
Results: 79 80 81 82 83 84 85 86
Results: 80 81 82 83 84 85 86 87
Results: 81 82 83 84 85 86 87 88
Results: 82 83 84 85 86 87 88 89
Results: 83 84 85 86 87 88 89 90
Results: 84 85 86 87 88 89 90 91
Results: 85 86 87 88 89 90 91 92
Results: 86 87 88 89 90 91 92 93
Results: 87 88 89 90 91 92 93 94
Results: 开发者_如何学Python88 89 90 91 92 93 94 95
Results: 89 90 91 92 93 94 95 96
Results: 90 91 92 93 94 95 96 97
Results: 91 92 93 94 95 96 97 98
Results: 92 93 94 95 96 97 98 99
Results: 93 94 95 96 97 98 99 100
Results: 94 95 96 97 98 99 100 101
Results: 95 96 97 98 99 100 101 102
Results: 96 97 98 99 100 101 102 103
Results: 97 98 99 100 101 102 103 104
Results: 98 99 100 101 102 103 104 105
Results: 99 100 101 102 103 104 105 106
Results: 100 101 102 103 104 105 106 107
Results: 101 102 103 104 105 106 107 108
Results: 102 103 104 105 106 107 108 109
Results: 103 104 105 106 107 108 109 110
Results: 104 105 106 107 108 109 110 111
Results: 105 106 107 108 109 110 111 112
Results: 106 107 108 109 110 111 112 113
Results: 107 108 109 110 111 112 113 114
Results: 108 109 110 111 112 113 114 115
Results: 109 110 111 112 113 114 115 116
Results: 110 111 112 113 114 115 116 117
Results: 111 112 113 114 115 116 117 118
Results: 112 113 114 115 116 117 118 119
Results: 113 114 115 116 117 118 119 120
Results: 114 115 116 117 118 119 120 121
Results: 115 116 117 118 119 120 121 122
Results: 116 117 118 119 120 121 122 123
Results: 117 118 119 120 121 122 123 124
Results: 118 119 120 121 122 123 124 125
Results: 119 120 121 122 123 124 125 126
Results: 120 121 122 123 124 125 126 127
Results: 121 122 123 124 125 126 127 128
Results: 122 123 124 125 126 127 128 129
Results: 123 124 125 126 127 128 129 130
Results: 124 125 126 127 128 129 130 131
Results: 125 126 127 128 129 130 131 132
Results: 126 127 128 129 130 131 132 133
Results: 127 128 129 130 131 132 133 134
Wouldn't I expect 0..1024 to be printed??
Am I misunderstanding something? I read the intro sections of the NVIDIA CUDA programming guide, and I thought this is how things worked.
Of course I've run into plenty of annoying bugs/design limitations thus far (for example the 8000 series' lack of double precision support) and errors that CUDA causes with iomanip commands (setw, setprecision) if you use "std::..." instead of a general "using namespace std;"
So I guess I expect some whackiness...
But I'm just desperate to figure out what the heck is going on here...
Change:
for(unsigned int t=0; t<1024/8;t++) {
to:
for(unsigned int t=0; t<1024; t+=8) {
You have 2 x 512 = 1024 threads, whose indices range from 0..1023. Each thread is writing its own index to the corresponding location in MyBlock. Hence you expect to see an array whose values are equal to its indices.
精彩评论