v1.13.0-aws
(2024-11-18)
This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.
With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.
The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
New features:
-
AWS
P5en
platform support was added. -
support was added for the NCCL v3 tuner API. The tuner now supports multiple
platforms and supports multiple collectives. -
Scheduling improvements were made to the plugin RDMA protocol. In multirail
configurations, this is expected to balance traffic more optimally. -
dmabuf memory registration support was added. Users facing problems with
dmabuf may disable dmabuf withOFI_NCCL_DISABLE_DMABUF=1
.
Breaking changes:
-
As mentioned above, building with support for platform-aws now requires
libfabric version 1.22.0amzn4.0 or greater. -
Under CUDA, the plugin now statically links the CUDA runtime by default.
Packagers preferring to dynamically link CUDA may pass
--enable-cudart-dynamic
at configure time to disable this.
Supported Distributions
- Amazon Linux 2
- Amazon Linux 2023
- Ubuntu 20.04 LTS, 22.04 LTS.
For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.
What's Changed
- ci: build oldest working EFA installer and latest by @aws-nslick in #522
- api: fail when using connect/accept_v4 with RDMA protocol by @rauteric in #529
- rdma: write topo file only for multi-rail platforms by @rauteric in #532
- dist: set
TAR_OPTIONS
to remove ownership info by @rauteric in #523 - Revert ".ci/aws: Add trainium tests to CI" by @a-szegel in #535
- nvidia: Change default network name to "Libfabric" by @bwbarrett in #530
- tuner: support tuner v3 API by @AmedeoSapio in #524
- init: Avoid hang by forcing SENDRECV in case of neuron v4 API usage by @maxtmann in #537
- Fix naming of array in nccl_net_ofi_plugin_init by @ryanhankins in #539
- Revert "param: increase CQ read count to 16 for performance" by @maxtmann in #538
- .ci/aws: Add g4dn testing to PR CI by @a-szegel in #527
- .ci/aws: Make failures happen in correct stage by @a-szegel in #528
- platform: Set RDMA protocol as default for trn1/trn1n platforms by @maxtmann in #540
- Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms by @maxtmann in #544
- ci: cache efa installer by @aws-nslick in #545
- ci: fix efa installer caching by @aws-nslick in #546
- fix(rdma): endpont_per_comm: NULL ptr bug by @rauteric in #551
- tuner: Enable tuner init msg on INFO logs by @arunkarthik-akkart in #549
- .ci/aws: Decrease NCCL_TEST iterations to 5 by @a-szegel in #550
- fix(tree): use correct __cplusplus guards by @aws-nslick in #554
- Separate endpoint for control messages by @rajachan in #543
- fix(tree): add spaces around PRIu64 by @aws-nslick in #555
- feat(tree): add static_assert shim macro by @aws-nslick in #556
- fix(aws): align declaration and init order by @aws-nslick in #557
- fix(rdma): fi_{send,write}data: do arithmetic on uintptr by @aws-nslick in #558
- fix(tuner): don't choose NVLSTree if nRanks==nNodes by @AmedeoSapio in #583
- rdma: Eliminate unnecessary ctrl message waits in eager protocol by @rauteric in #553
- fix(tracing): use header-only nvtx3 by @aws-nslick in #590
- chore(.github/workflows): constrain push triggers to known branches by @aws-nslick in #582
- feat(build): better --enable-debug defaults by @aws-nslick in #596
- fix(freelist): use uintptr_t for pointer arithmetic by @aws-nslick in #560
- Fix: access domain from ep during mr on device by @maxtmann in #602
- Feature/v6 rma ops by @maxtmann in #541
- platform: trn1 default protocol send receive by @hunnorth in #603
- fix(tree): import libfabric's container_of macro by @aws-nslick in #605
- fix(valgrind): fix autotools mistake by @aws-nslick in #607
- feat(ci/github): use docker instead of codebuild by @aws-nslick in #608
- CI updates by @rajachan in #612
- util: Use FI_ENOPROTOOPT to check for a provider's support for option by @rajachan in #613
- Fix log format string behavior by @bwbarrett in #615
- Improve protocol selection logic by @bwbarrett in #610
- .ci/aws: Unpin al2 p3dn ami by @a-szegel in #552
- .ci/aws: re-Add trainium tests to CI by @a-szegel in #619
- fix(m4): set redzone size to 0 by @rauteric in #616
- Fully destroy endpoints when refcount is 0 by @bwbarrett in #617
- feat: add DMA-BUF support by @aws-nslick in #618
- Improve end of process cleanup and reporting by @bwbarrett in #620
- fix(rdma): stop setting FI_ORDER_NONE by @aws-nslick in #621
- fix(tree): use empty brace initializers for zero-initialization by @aws-nslick in #594
- fix(build): ensure -pthread is passed by @aws-nslick in #623
- fix(build): add missing AC_PROG_RANLIB by @aws-nslick in #622
- fix(ci): prefer ecr to dockerhub by @aws-nslick in #628
- feat(build): disable semantic interposition by @aws-nslick in #624
- fix(init): fix sendrecv fallback logic by @aws-nslick in #629
- fix: rdma: inverted print statement by @aws-nslick in #630
- rdma: Use get_device_from_ep() accessor by @bwbarrett in #626
- Combined -Wextra -Werror Commits by @aws-nslick in #627
- Add platform data settings for TRN2N by @maxtmann in #638
- tuner: add regions for AllGather/ReduceScatter in the one rank per node case by @AmedeoSapio in #641
- fix(rdma): send periodic control messages to sync sender/receiver by @rauteric in #640
- feat(build): add -fanalyzer when --enable-werror by @aws-nslick in #632
- Add Multiplexed-round-robin scheduler by @arunkarthik-akkart in #604
- fix : Fix flexible array member allocation by @arunkarthik-akkart in #649
- Revert "neuron: Disable rdma eager messages by default" by @maxtmann in #650
- .ci/aws: All CI use ami with EFA Installer by @a-szegel in #648
- separate out 3rd-party headers by @aws-nslick in #634
- Add a proper endpoint interface by @bwbarrett in #654
- feat(ci): add workflow_dispatch to distcheck by @aws-nslick in #658
- Fix use of uninitialized lock by @bwbarrett in #659
- aws: Skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5en by @rajachan in #625
- rdma: remove "request completed with error" message by @rauteric in #660
- rdma: do local RDMA read on all NIC rails for flush() by @taeilum00 in #652
- Fix abort when cache is disabled. by @bwbarrett in #662
- feat: Make tuner platform specific by @arunkarthik-akkart in #657
- Couple of accessor function / code cleanups by @bwbarrett in #661
- rdma: Revert commits eliminating eager waits by @rauteric in #664
- Cleanups from adding a domain interface by @bwbarrett in #670
- fix: Change multiplexer scheduler to use two rails instead of three by @arunkarthik-akkart in #669
- Add p5en platform_data and update default latency for undefined platforms by @rajachan in #672
- Fix a number of duplicate definition names by @bwbarrett in #667
- .ci/aws: Move p5 capacity to CGK by @sunkuamzn in #680
- Fix device sorting on aws platforms by @bwbarrett in #679
- rdma: add option to round robin the ctrl msg, and use shared CQs for control and data endpoints by @AmedeoSapio in #673
- Add option to abort() on error by @bwbarrett in #683
- Reduce repetitive INFO printing by @bwbarrett in #684
- aws: Override libfabric link_attr for certain platforms by @rajachan in #686
- Switch CI to persistent clusters with containers by @sunkuamzn in #687
- cuda: build flag for dynamically or statically linking cudart by @aws-nslick in #688
- Add platform data settings for TRN2 by @hunnorth in #693
- .ci/Jenkins: General Cleanup and Remove Region/CI From CI by @a-szegel in #694
- tuner: add model base tuner and refactor for co-exist by @taeilum00 in #692
- defaults: make dmabuf opt-in by @aws-nslick in #695
- .ci/aws: Improve CI Speed by @a-szegel in #701
- fix: ep release in endpoint per comm by @AmedeoSapio in #706
- rdma: Set FI_MORE when posting receive buffers by @bwbarrett in #705
- rdma: Set LOW_LATENCY traffic class for control by @bwbarrett in #702
- reenable dmabuf by default by @aws-nslick in #703
- MR: Enforce page-aligned buffer registration for iovec and add corresponding test case by @mozarhua in #685
- core: Leave endpoint created during init by @bwbarrett in #710
- feat: Region-based tuner support for P5en by @arunkarthik-akkart in #704
- fix: Fallback to internal tuner on NCCL-2.21.5 for PAT by @arunkarthik-akkart in #714
- release: v1.13.x aws by @aws-nslick in #712
New Contributors
- @arunkarthik-akkart made their first contribution in #549
- @hunnorth made their first contribution in #603
- @mozarhua made their first contribution in #685
Full Changelog: v1.12.1-aws...v1.13.0-aws