AWS OFI NCCL v1.9.2
AmedeoSapio
released this
17 Jun 23:33
·
380 commits
to master
since this release
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).
Bug Fixes:
- Improved tuner model to make better decisions on P5 instances.
- Added support, in RDMA protocol, for truncation when receiving a size in the isend call greater than the size in the correspond irecv.
- Fixed bug that prevented the tuner from getting loaded with NCCL 2.19 and 2.20.
- Fixed logging statement regarding if a domain is created per thread or per process.
- Updated plugin to not advertise global MR support, to avoid a performance regression in user-registered buffers.
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
1e344f38baa1080c04d2c99a1390f51e2a9ce2a57d69c7494061bf4e5da5a4310328bafc323cb36f43b5fcd0d330bd1bd5eec257596de2125aa5c38096b78a01 aws-ofi-nccl-1.9.2-aws.tar.gz