Releases: aws/aws-ofi-nccl
AWS OFI NCCL v1.13.2
v1.13.2-aws (2024-12-06)
This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.
With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.
The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
Bug Fixes:
- Tuner Improvements:
- Fixed algorithm selection for larger ranks and message sizes.
- Re-calibrated the tuner for AllGather and ReduceScatter regions for 0x7 bitmask on P5en, optimizing performance for larger messages.
- Added tuner support for AllGather and ReduceScatter regions for 0x0 bitmask on P5en.
- Resolved a performance issue by preventing the eager protocol when RDMA writes are in flight, improving small AllReduce collective performance.
Note: dmabuf support is now turned off by default. Users can enable it explicitly using OFI_NCCL_DISABLE_DMABUF=0 if needed.
Checksum (sha512) for the release tarball:
4c0ac3144f178062fda9e86b50bb1784822e8fdbdffadf41cdbb30839456c4e912254ff12a5b0a8c63abbe910597fd14211a42572a451d10e01932100013971e aws-ofi-nccl-1.13.2-aws.tar.gz
AWS OFI NCCL v1.13.1
(2024-11-26)
This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.
With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.
The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
Supported Distributions
- Amazon Linux 2
- Amazon Linux 2023
- Ubuntu 20.04 LTS, 22.04 LTS.
For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.
What's Changed
This release contains no functional changes compared to v1.13.0-aws. This release merely updates the version
set in AC_INIT
to include the -aws
suffix to match the tag name and ensure generated artifacts are named correctly.
Checksum (sha512) for the release tarball:
b71afd2e7776b77392c91abb818fa011e415f31fa9061556cd725d7a52eb4101b45a10fe91284ec7cff06a9653456e95ae70a472affb32f68e01b1ce5e49ff83 aws-ofi-nccl-1.13.1-aws.tar.gz
v1.13.0-aws
(2024-11-18)
This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.
With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.
The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
New features:
-
AWS
P5en
platform support was added. -
support was added for the NCCL v3 tuner API. The tuner now supports multiple
platforms and supports multiple collectives. -
Scheduling improvements were made to the plugin RDMA protocol. In multirail
configurations, this is expected to balance traffic more optimally. -
dmabuf memory registration support was added. Users facing problems with
dmabuf may disable dmabuf withOFI_NCCL_DISABLE_DMABUF=1
.
Breaking changes:
-
As mentioned above, building with support for platform-aws now requires
libfabric version 1.22.0amzn4.0 or greater. -
Under CUDA, the plugin now statically links the CUDA runtime by default.
Packagers preferring to dynamically link CUDA may pass
--enable-cudart-dynamic
at configure time to disable this.
Supported Distributions
- Amazon Linux 2
- Amazon Linux 2023
- Ubuntu 20.04 LTS, 22.04 LTS.
For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.
What's Changed
- ci: build oldest working EFA installer and latest by @aws-nslick in #522
- api: fail when using connect/accept_v4 with RDMA protocol by @rauteric in #529
- rdma: write topo file only for multi-rail platforms by @rauteric in #532
- dist: set
TAR_OPTIONS
to remove ownership info by @rauteric in #523 - Revert ".ci/aws: Add trainium tests to CI" by @a-szegel in #535
- nvidia: Change default network name to "Libfabric" by @bwbarrett in #530
- tuner: support tuner v3 API by @AmedeoSapio in #524
- init: Avoid hang by forcing SENDRECV in case of neuron v4 API usage by @maxtmann in #537
- Fix naming of array in nccl_net_ofi_plugin_init by @ryanhankins in #539
- Revert "param: increase CQ read count to 16 for performance" by @maxtmann in #538
- .ci/aws: Add g4dn testing to PR CI by @a-szegel in #527
- .ci/aws: Make failures happen in correct stage by @a-szegel in #528
- platform: Set RDMA protocol as default for trn1/trn1n platforms by @maxtmann in #540
- Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms by @maxtmann in #544
- ci: cache efa installer by @aws-nslick in #545
- ci: fix efa installer caching by @aws-nslick in #546
- fix(rdma): endpont_per_comm: NULL ptr bug by @rauteric in #551
- tuner: Enable tuner init msg on INFO logs by @arunkarthik-akkart in #549
- .ci/aws: Decrease NCCL_TEST iterations to 5 by @a-szegel in #550
- fix(tree): use correct __cplusplus guards by @aws-nslick in #554
- Separate endpoint for control messages by @rajachan in #543
- fix(tree): add spaces around PRIu64 by @aws-nslick in #555
- feat(tree): add static_assert shim macro by @aws-nslick in #556
- fix(aws): align declaration and init order by @aws-nslick in #557
- fix(rdma): fi_{send,write}data: do arithmetic on uintptr by @aws-nslick in #558
- fix(tuner): don't choose NVLSTree if nRanks==nNodes by @AmedeoSapio in #583
- rdma: Eliminate unnecessary ctrl message waits in eager protocol by @rauteric in #553
- fix(tracing): use header-only nvtx3 by @aws-nslick in #590
- chore(.github/workflows): constrain push triggers to known branches by @aws-nslick in #582
- feat(build): better --enable-debug defaults by @aws-nslick in #596
- fix(freelist): use uintptr_t for pointer arithmetic by @aws-nslick in #560
- Fix: access domain from ep during mr on device by @maxtmann in #602
- Feature/v6 rma ops by @maxtmann in #541
- platform: trn1 default protocol send receive by @hunnorth in #603
- fix(tree): import libfabric's container_of macro by @aws-nslick in #605
- fix(valgrind): fix autotools mistake by @aws-nslick in #607
- feat(ci/github): use docker instead of codebuild by @aws-nslick in #608
- CI updates by @rajachan in #612
- util: Use FI_ENOPROTOOPT to check for a provider's support for option by @rajachan in #613
- Fix log format string behavior by @bwbarrett in #615
- Improve protocol selection logic by @bwbarrett in #610
- .ci/aws: Unpin al2 p3dn ami by @a-szegel in #552
- .ci/aws: re-Add trainium tests to CI by @a-szegel in #619
- fix(m4): set redzone size to 0 by @rauteric in #616
- Fully destroy endpoints when refcount is 0 by @bwbarrett in #617
- feat: add DMA-BUF support by @aws-nslick in #618
- Improve end of process cleanup and reporting by @bwbarrett in #620
- fix(rdma): stop setting FI_ORDER_NONE by @aws-nslick in #621
- fix(tree): use empty brace initializers for zero-initialization by @aws-nslick in #594
- fix(build): ensure -pthread is passed by @aws-nslick in #623
- fix(build): add missing AC_PROG_RANLIB by @aws-nslick in #622
- fix(ci): prefer ecr to dockerhub by @aws-nslick in #628
- feat(build): disable semantic interposition by @aws-nslick in #624
- fix(init): fix sendrecv fallback logic by @aws-nslick in #629
- fix: rdma: inverted print statement by @aws-nslick in #630
- rdma: Use get_device_from_ep() accessor by @bwbarrett in #626
- Combined -Wextra -Werror Commits by @aws-nslick in #627
- Add platform data settings for TRN2N by @maxtmann in #638
- tuner: add regions for AllGather/ReduceScatter in the one rank per node case by @AmedeoSapio in #641
- fix(rdma): send periodic control messages to sync sender/receiver by @rauteric in #640
- feat(build): add -fanalyzer when --enable-werror by @aws-nslick in #632
- Add Multiplexed-round-robin scheduler by @arunkarthik-akkart in #604
- fix : Fix flexible array member allocation by @arunkarthik-akkart in #649
- Revert "neuron: Disable rdma eager messages by default" by @maxtmann in #650
- .ci/aws: All CI use ami with EFA Installer by @a-szegel in #648
- separate out 3rd-party headers by @aws-nslick in #634
- Add a proper endpoint interface by @bwbarrett in #654
- feat(ci): add workflow_dispatch to distcheck by @aws-nslick in #658
- Fix use of uninitialized lock by @bwbarrett in #659
- aws: Skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5en by @rajachan in #625
- rdma: remove "request completed with error" message by @rauteric in #660
- rdma: do local RDMA read on all NIC rails for flush() by @taeilum00 in https://...
AWS OFI NCCL v1.12.1
All users of v1.12.0-aws are strongly recommended to take this fix when using EFA Installer >= 1.35.0.
Bug fixes:
- platform-aws vf sorting code produces significant performance regressions or
crashes when used atop latest EFA driver releases. This sorting code has been
reverted and mitigates the problem. (adb47dc)
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Digests:
3722e0790b98e65d04f143fe8484fd0a05dcec6419eeac1cdcad5e49f6c7cf8e aws-ofi-nccl-1.12.1-aws.tar.gz
AWS OFI NCCL v1.11.1
All users of v1.11.0-aws are strongly recommended to take this fix when using EFA Installer >= 1.35.0.
Bug fixes:
- platform-aws vf sorting code produces significant performance regressions or
crashes when used atop latest EFA driver releases. This sorting code has been
reverted and mitigates the problem. (84b7cfa)
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Digests:
6d95eff619208e30d11044068c3781c1c079b180a683d422ce9f6a96ebeadb80 aws-ofi-nccl-1.11.1-aws.tar.gz
AWS OFI NCCL v1.12.0
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
New Features:
- Support for tuner v3 APIs
- Support for AllGather and ReduceScatter in the tuner
- Support for PAT algorithm in the tuner
Bug fixes:
- Fixed NULL pointer access in the endpoint per communicator path
- Replaced the NVLSTree option in the tuner with RING if nRanks==nNodes
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
7d9e41ce04253a32a13542e7f4c2d20c2a5a43cdfb575fe153954c5faed8cf85eb08dab76ee0f883109f7610bb43cb8b703fe2f1e98b8f02bbfa866dd1c268e1 aws-ofi-nccl-1.12.0-aws.tar.gz
AWS OFI NCCL v1.11.0
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.22.3-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
New Features:
- Autogenerate topology file on P5 by default, with detected topology, instead of using a static file
- Support for AWS P5e instance type
Bug fixes:
- Fixed segfault for platform-aws builds for instance types not explicitly configured
- Fixed failure in mr cache in SENDRECV protocol for providers that don't require memory registration
- Re-enabled WRITE_IN_ORDER_ALIGNED_128_BYTES setting and check on P5.
- Added check to cause an error when using old blocking connect_v4/accept_v4 interfaces with RDMA protocol. The previous release changed connection establishment such that these interfaces cause deadlock.
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
17063f1e10a885fe6cd48e275c9a0d5748b73d04d6514103a5e9a0f28dff604c1766f8a85a55e89ad5691830c54199936d88442d28c65180c2f79be939f0b208 aws-ofi-nccl-1.11.0-aws.tar.gz
AWS OFI NCCL v1.10.0
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
New Features:
- Replaced the model-based tuner with one based on regions derived from experimental evaluations.
- Changed properties reported to NCCL to signal that registered MRs are global, in order to support user buffer registrations.
- Added the option to use different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue.
- Updated plugin to use the zero-copy path in the EFA provider for fi_send/fi_recv operations.
- Shrank the control message to 32 bytes to fit in inline data for EFA.
Bug Fixes:
- Disabled Libfabric shared memory when possible.
- Disabled RDMA eager messages on Neuron by default for better performance.
- Ensured plugin's multi-rail protocol consistently sorts rails in order of VF index for better performance.
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
fa296339a7e40fa420e2934c3a44f9a18ad3a9d798b7f129b35f46892f76532b70996fe36f309e3dedd2823ed9a819a4578f7c8241d8549805c49811b38ae14f aws-ofi-nccl-1.10.0-aws.tar.gz
AWS OFI NCCL v1.9.2
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).
Bug Fixes:
- Improved tuner model to make better decisions on P5 instances.
- Added support, in RDMA protocol, for truncation when receiving a size in the isend call greater than the size in the correspond irecv.
- Fixed bug that prevented the tuner from getting loaded with NCCL 2.19 and 2.20.
- Fixed logging statement regarding if a domain is created per thread or per process.
- Updated plugin to not advertise global MR support, to avoid a performance regression in user-registered buffers.
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
1e344f38baa1080c04d2c99a1390f51e2a9ce2a57d69c7494061bf4e5da5a4310328bafc323cb36f43b5fcd0d330bd1bd5eec257596de2125aa5c38096b78a01 aws-ofi-nccl-1.9.2-aws.tar.gz
AWS OFI NCCL v1.9.1
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).
Bug Fixes:
- Fix release distribution generation to include missing headers introduced in v1.9.0. This fixes issue #382.
- Restrict libcuda link-time dependency to builds with testing enabled
- Build fixes to explicitly link against libm and libpthread used by the plugin
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
77e44dcdb77e6b25cae882d2124b6d9a2a66f2b85321ae827ec7e3fd88bacd214a537a2490a578af44b7457cc655b2e382fc148b6ed8594a68a30d145f3ce70e aws-ofi-nccl-1.9.1-aws.tar.gz