Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm as alternative to CUDA for plugin use #461

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ryanhankins
Copy link
Contributor

Description of changes:

See commit messages for more detail. Add a --with-rocm flag to configure.ac to switch between CUDA and ROCm GPU calls, to support AMD GPUs. Add code to fiiles to abstract CUDA calls, and, upon the use of the --with-rocm option, to call the ROCm alternatives.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ryanhankins ryanhankins changed the title Merge6 Add ROCm as alternative to CUDA for plugin use. Jun 27, 2024
@ryanhankins ryanhankins changed the title Add ROCm as alternative to CUDA for plugin use. Add ROCm as alternative to CUDA for plugin use Jun 27, 2024
@ryanhankins ryanhankins force-pushed the merge6 branch 8 times, most recently from 98282b4 to 064fb2c Compare June 27, 2024 18:32
@ryanhankins ryanhankins marked this pull request as ready for review June 28, 2024 11:19
@ryanhankins ryanhankins requested review from bwbarrett and a team as code owners June 28, 2024 11:19
@liralon
Copy link
Contributor

liralon commented Jun 28, 2024

@ryanhankins Can you please add to commit message some information on which platforms you have tested this functionality to work properly?

The nccl_net_ofi_cu* calls map directly to CUDA methods.  Instead of this
mapping, insert indirection via nccl_net_ofi_gpu methods so that the
implementation of the methods depends on CUDA, but the methods
themselves can be called for different underling frameworks (such as
ROCm).

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
ROCm provides an interface similar to CUDA, to work with AMD GPUs.
Provide a compile time option to build with ROCm instead of CUDA.

1. Add --with-rocm= flag to ./configure.
2. Make all CUDA calls "gpu" calls, which are independent of the
   underlying framework.
3. Switch between _rocm and _cuda files at compile time to make the
   appropriate calls.
4. When building for RCCL (AMD's NCCL), generate a rccl-net.so-named
   plugin for binary compatibility.

Tested on:

1. HPE Cray EX with EX235A BardPeak GPUs + 200Gb Slingshot adapters.
2. HPE Cray EX with NVIDIA A100 SXM4 80GB GPUs + 200 Gb Slingshot
    adapters.

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants