Skip to content

AWS OFI NCCL v1.9.2

Compare
Choose a tag to compare
@AmedeoSapio AmedeoSapio released this 17 Jun 23:33
· 380 commits to master since this release

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Improved tuner model to make better decisions on P5 instances.
  • Added support, in RDMA protocol, for truncation when receiving a size in the isend call greater than the size in the correspond irecv.
  • Fixed bug that prevented the tuner from getting loaded with NCCL 2.19 and 2.20.
  • Fixed logging statement regarding if a domain is created per thread or per process.
  • Updated plugin to not advertise global MR support, to avoid a performance regression in user-registered buffers.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

1e344f38baa1080c04d2c99a1390f51e2a9ce2a57d69c7494061bf4e5da5a4310328bafc323cb36f43b5fcd0d330bd1bd5eec257596de2125aa5c38096b78a01  aws-ofi-nccl-1.9.2-aws.tar.gz