CUDA kernel's vectors' length based on threadIdx

2023-02-25 09:35 问答作者：

This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:

for each xbin(0->detectorXDim/2-1):
 for each ybin(0->detectorYDim-1):
      rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
      rayEnd=beamFocusCoord
      slopeVector=rayEnd-rayInit
      //knowing that r=rayInit+t*slopeVector;
      //x=rayInit[0]+t*slopeVector[0]
      //y=rayInit[1]+t*slopeVector[1]
      //z=rayInit[2]+t*slopeVector[2]
      //to find ray xx intersections:
      for each xinteger(xbin+1->detectorXDim/2):
                solve t for x=xinteger*xBinSize;
                find corresponding y and z
                add to intersections array
      //find ray yy intersections(analogous to xx intersections)
      //find ray zz intersections(analogous to xx intersections)

So far, this is what I have come up with:

__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){

    int tx开发者_JAVA技巧=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
    int idx=ty+by*blocksize;
    int idy=tx+bx*blocksize;

    int slopeVectorx=xfocus-idx*xbinsize+0.5;
    int slopeVectory=yfocus-idy*ybinsize+0.5;
    int slopeVectorz=zfocus-zdetector;
    __syncthreads();


    //points where the ray intersects x axis
    int xint=idx+1;
    int yint=idy+1;
    int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
    int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
    int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];

    for(xint=xint; xint<detectorXDim/2;xint++){
            x=xint*xbinsize;
            t=(x-idx)/slopeVectorx;
            y=idy+t*slopeVectory;
            z=z+t*slopeVectorz;

            intersectionsx[xint-1]=x;
            intersectionsy[xint-1]=y;
            intersectionsz[xint-1]=z;
            __syncthreads();
    }
    ...
    }

This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:

Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):

    int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
    int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
    int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];

So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.

Thank you in advance ;)

EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?

For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.

So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(

So I am running out of ideas to deal with this problem and any help is appreciated.

No.

Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.

The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.

Go for the simple solution if you can and for the complex if you have to, due to memory limitations.

Why are you not using textures? Using a 2D or 3D texture would make this problem much easier. The GPU is designed to do very fast floating point interpolation, and CUDA includes excellent support for it. The literature has examples of projection reconstruction on the GPU, e.g. Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU, and textures are an integral part of their algorithms. Your own manual coordinates calculations can only be slower and more error prone than what the GPU provides, unless you need something weird like sinc interpolation.

1400x3600x60 is a little big for a single 3D texture, but you could break your problem up into 2D slices, 3D sub-volumes, or hierarchical multi-resolution reconstruction. These have all been used by other researchers. Just search PubMed.

继续阅读：vector

CUDA kernel's vectors' length based on threadIdx

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？