You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CPU multithreading can be easily accomplished by adding !$acc directives to loops and adding the -ta=multicore command line option. Since no device-to-host memory is required, no update device (and maybe even no declare create) clauses are required, so this should be a relatively simple task. It would also require the request of multiple cores per task, but this is already part of SLURM. This feature would be particularly useful for simulations that use unified memory on GH200 and MI300A chips, where pre_process and post_process can take a significant amount of time if run on only one core. It would also potentially be useful for problems that involve STLs, which require a ray tracing step in pre_process, and when derived quantities like vorticity of Q-criterion are needed in post_process. I know this works with NVHPC, but I haven't tried it with CCE yet.
The text was updated successfully, but these errors were encountered:
You can also use all of the cores on a CPU die via MPI. It's unclear whether OpenACC gives much advantage here, no? Historically, OpenMP has been used for such multithreading, but those advantages over the latest MPI implementations have mostly gone away.
The benefit of using multithreading over MPI is that file_per_process can be used, and domain decomposition doesn't have to be performed twice. OpenACC's multithreading showed decent speedups on the course project I finished recently.
The benefit of using multithreading over MPI is that file_per_process can be used, and domain decomposition doesn't have to be performed twice. OpenACC's multithreading showed decent speedups on the course project I finished recently.
It seems reasonable... did you compare it against MPI?
CPU multithreading can be easily accomplished by adding
!$acc
directives to loops and adding the-ta=multicore
command line option. Since no device-to-host memory is required, noupdate device
(and maybe even nodeclare create
) clauses are required, so this should be a relatively simple task. It would also require the request of multiple cores per task, but this is already part of SLURM. This feature would be particularly useful for simulations that use unified memory on GH200 and MI300A chips, wherepre_process
andpost_process
can take a significant amount of time if run on only one core. It would also potentially be useful for problems that involve STLs, which require a ray tracing step inpre_process
, and when derived quantities like vorticity of Q-criterion are needed inpost_process
. I know this works with NVHPC, but I haven't tried it with CCE yet.The text was updated successfully, but these errors were encountered: