-
Notifications
You must be signed in to change notification settings - Fork 1
Parallel For Loop Inside Kernel
Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jul 7, 2023
·
3 revisions
Users can easily harness the power of 256 GPU threads per object(or per lower-energy candidate) in kernel code string:
parallelFor(N,{ int i = loopId; /* do something with i */ });
this is equivalent to a conditional execution on all threads of workgroup:
{
const int nIter = (ITERS / WorkGroupThreads) + 1;
for(int iGPGPU=0;iGPGPU<nIter;iGPGPU++)
{
const int loopId = threadId + WorkGroupThreads * iGPGPU;
if(loopId < ITERS)
{
// code block
}
}
}
There is also another version with barrier for synchronizing workgroup threads:
parallelForWithBarrier(N,{ int i = loopId; /* do something with i */ });
this is equal to following code:
{
const int nIter = (ITERS / WorkGroupThreads) + 1;
for(int iGPGPU=0;iGPGPU<nIter;iGPGPU++)
{
const int loopId = threadId + WorkGroupThreads * iGPGPU;
if(loopId < ITERS)
{
// code block
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
neural network uses parallelFor to compute 8 neurons at once.