Skip to content

Releases: aws/aws-ofi-nccl

AWS OFI NCCL v1.13.2

11 Dec 17:58
v1.13.2-aws
Compare
Choose a tag to compare

v1.13.2-aws (2024-12-06)

This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Bug Fixes:

  • Tuner Improvements:
    • Fixed algorithm selection for larger ranks and message sizes.
    • Re-calibrated the tuner for AllGather and ReduceScatter regions for 0x7 bitmask on P5en, optimizing performance for larger messages.
    • Added tuner support for AllGather and ReduceScatter regions for 0x0 bitmask on P5en.
  • Resolved a performance issue by preventing the eager protocol when RDMA writes are in flight, improving small AllReduce collective performance.

Note: dmabuf support is now turned off by default. Users can enable it explicitly using OFI_NCCL_DISABLE_DMABUF=0 if needed.

Checksum (sha512) for the release tarball:

4c0ac3144f178062fda9e86b50bb1784822e8fdbdffadf41cdbb30839456c4e912254ff12a5b0a8c63abbe910597fd14211a42572a451d10e01932100013971e  aws-ofi-nccl-1.13.2-aws.tar.gz

AWS OFI NCCL v1.13.1

26 Nov 23:10
v1.13.1-aws
Compare
Choose a tag to compare

(2024-11-26)

This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Supported Distributions

  • Amazon Linux 2
  • Amazon Linux 2023
  • Ubuntu 20.04 LTS, 22.04 LTS.

For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.

What's Changed

This release contains no functional changes compared to v1.13.0-aws. This release merely updates the version set in AC_INIT to include the -aws suffix to match the tag name and ensure generated artifacts are named correctly.

Checksum (sha512) for the release tarball:

b71afd2e7776b77392c91abb818fa011e415f31fa9061556cd725d7a52eb4101b45a10fe91284ec7cff06a9653456e95ae70a472affb32f68e01b1ce5e49ff83  aws-ofi-nccl-1.13.1-aws.tar.gz

v1.13.0-aws

19 Nov 05:37
v1.13.0-aws
cf7606e
Compare
Choose a tag to compare

(2024-11-18)

This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New features:

  • AWS P5en platform support was added.

  • support was added for the NCCL v3 tuner API. The tuner now supports multiple
    platforms and supports multiple collectives.

  • Scheduling improvements were made to the plugin RDMA protocol. In multirail
    configurations, this is expected to balance traffic more optimally.

  • dmabuf memory registration support was added. Users facing problems with
    dmabuf may disable dmabuf with OFI_NCCL_DISABLE_DMABUF=1.

Breaking changes:

  • As mentioned above, building with support for platform-aws now requires
    libfabric version 1.22.0amzn4.0 or greater.

  • Under CUDA, the plugin now statically links the CUDA runtime by default.
    Packagers preferring to dynamically link CUDA may pass
    --enable-cudart-dynamic at configure time to disable this.

Supported Distributions

  • Amazon Linux 2
  • Amazon Linux 2023
  • Ubuntu 20.04 LTS, 22.04 LTS.

For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.

What's Changed

Read more

AWS OFI NCCL v1.12.1

25 Oct 05:10
v1.12.1-aws
2301579
Compare
Choose a tag to compare

All users of v1.12.0-aws are strongly recommended to take this fix when using EFA Installer >= 1.35.0.

Bug fixes:

  • platform-aws vf sorting code produces significant performance regressions or
    crashes when used atop latest EFA driver releases. This sorting code has been
    reverted and mitigates the problem. (adb47dc)

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Digests:

3722e0790b98e65d04f143fe8484fd0a05dcec6419eeac1cdcad5e49f6c7cf8e  aws-ofi-nccl-1.12.1-aws.tar.gz

AWS OFI NCCL v1.11.1

25 Oct 05:10
v1.11.1-aws
2db8375
Compare
Choose a tag to compare

All users of v1.11.0-aws are strongly recommended to take this fix when using EFA Installer >= 1.35.0.

Bug fixes:

  • platform-aws vf sorting code produces significant performance regressions or
    crashes when used atop latest EFA driver releases. This sorting code has been
    reverted and mitigates the problem. (84b7cfa)

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Digests:

6d95eff619208e30d11044068c3781c1c079b180a683d422ce9f6a96ebeadb80  aws-ofi-nccl-1.11.1-aws.tar.gz

AWS OFI NCCL v1.12.0

08 Oct 01:41
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Support for tuner v3 APIs
  • Support for AllGather and ReduceScatter in the tuner
  • Support for PAT algorithm in the tuner

Bug fixes:

  • Fixed NULL pointer access in the endpoint per communicator path
  • Replaced the NVLSTree option in the tuner with RING if nRanks==nNodes

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

7d9e41ce04253a32a13542e7f4c2d20c2a5a43cdfb575fe153954c5faed8cf85eb08dab76ee0f883109f7610bb43cb8b703fe2f1e98b8f02bbfa866dd1c268e1  aws-ofi-nccl-1.12.0-aws.tar.gz

AWS OFI NCCL v1.11.0

19 Aug 20:28
v1.11.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.22.3-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Autogenerate topology file on P5 by default, with detected topology, instead of using a static file
  • Support for AWS P5e instance type

Bug fixes:

  • Fixed segfault for platform-aws builds for instance types not explicitly configured
  • Fixed failure in mr cache in SENDRECV protocol for providers that don't require memory registration
  • Re-enabled WRITE_IN_ORDER_ALIGNED_128_BYTES setting and check on P5.
  • Added check to cause an error when using old blocking connect_v4/accept_v4 interfaces with RDMA protocol. The previous release changed connection establishment such that these interfaces cause deadlock.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

17063f1e10a885fe6cd48e275c9a0d5748b73d04d6514103a5e9a0f28dff604c1766f8a85a55e89ad5691830c54199936d88442d28c65180c2f79be939f0b208  aws-ofi-nccl-1.11.0-aws.tar.gz

AWS OFI NCCL v1.10.0

06 Aug 21:38
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Replaced the model-based tuner with one based on regions derived from experimental evaluations.
  • Changed properties reported to NCCL to signal that registered MRs are global, in order to support user buffer registrations.
  • Added the option to use different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue.
  • Updated plugin to use the zero-copy path in the EFA provider for fi_send/fi_recv operations.
  • Shrank the control message to 32 bytes to fit in inline data for EFA.

Bug Fixes:

  • Disabled Libfabric shared memory when possible.
  • Disabled RDMA eager messages on Neuron by default for better performance.
  • Ensured plugin's multi-rail protocol consistently sorts rails in order of VF index for better performance.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

fa296339a7e40fa420e2934c3a44f9a18ad3a9d798b7f129b35f46892f76532b70996fe36f309e3dedd2823ed9a819a4578f7c8241d8549805c49811b38ae14f  aws-ofi-nccl-1.10.0-aws.tar.gz

AWS OFI NCCL v1.9.2

17 Jun 23:33
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Improved tuner model to make better decisions on P5 instances.
  • Added support, in RDMA protocol, for truncation when receiving a size in the isend call greater than the size in the correspond irecv.
  • Fixed bug that prevented the tuner from getting loaded with NCCL 2.19 and 2.20.
  • Fixed logging statement regarding if a domain is created per thread or per process.
  • Updated plugin to not advertise global MR support, to avoid a performance regression in user-registered buffers.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

1e344f38baa1080c04d2c99a1390f51e2a9ce2a57d69c7494061bf4e5da5a4310328bafc323cb36f43b5fcd0d330bd1bd5eec257596de2125aa5c38096b78a01  aws-ofi-nccl-1.9.2-aws.tar.gz

AWS OFI NCCL v1.9.1

15 Apr 21:45
v1.9.1-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Fix release distribution generation to include missing headers introduced in v1.9.0. This fixes issue #382.
  • Restrict libcuda link-time dependency to builds with testing enabled
  • Build fixes to explicitly link against libm and libpthread used by the plugin

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

77e44dcdb77e6b25cae882d2124b6d9a2a66f2b85321ae827ec7e3fd88bacd214a537a2490a578af44b7457cc655b2e382fc148b6ed8594a68a30d145f3ce70e  aws-ofi-nccl-1.9.1-aws.tar.gz