Skip to content

Parallel For Loop Inside Kernel

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jul 7, 2023 · 3 revisions

Users can easily harness the power of 256 GPU threads per object(or per lower-energy candidate) in kernel code string:

parallelFor(N,{ int i = loopId; /* do something with i */ });

this is equivalent to a conditional execution on all threads of workgroup:

{
    const int nIter = (ITERS / WorkGroupThreads) + 1;     
        for(int iGPGPU=0;iGPGPU<nIter;iGPGPU++)                          
        {                                                       
            const int loopId = threadId + WorkGroupThreads * iGPGPU; 
            if(loopId < ITERS)                                  
            {                                                   
                // code block
            } 
        }
}

There is also another version with barrier for synchronizing workgroup threads:

parallelForWithBarrier(N,{ int i = loopId; /* do something with i */ });

this is equal to following code:

{
    const int nIter = (ITERS / WorkGroupThreads) + 1;     
        for(int iGPGPU=0;iGPGPU<nIter;iGPGPU++)                          
        {                                                       
            const int loopId = threadId + WorkGroupThreads * iGPGPU; 
            if(loopId < ITERS)                                  
            {                                                   
                // code block
            } 
            barrier(CLK_LOCAL_MEM_FENCE);
        }
}

neural network uses parallelFor to compute 8 neurons at once.

Clone this wiki locally