Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tree: cleanup "gdr_support" variable #711

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aws-nslick
Copy link
Contributor

In the default case, we lazily create all fabric resources at the time of communicator creation, such that they end up owned by the correct thread and/or are resident on the correct cpu socket and memory domain.

previously, there existed an ugly dependency chain in our init: while the large majority of the provider properties that we care about can be extracted from fi_getinfo responses, some can only be effectively queried by attempting mutations against an existing endpoint/domain/etc and seeing if it failed or not. A further subset of these properties need to be exposed back by nccl-net-ofi to nccl, at the time of getProperties, and prior to communicator instantiation.

to work around this, late in init we pick a device, instantiate it, query the attributes we need for getProperties, and then tear it all down. This is expensive and delays our init, as well as exposing us to bugs from incomplete teardown.

The sole case in the codebase today where this is necessary today is around detecting gdr support for FI_HMEM_CUDA. With dmabuf now as the default, it's relatively safe to just avoid the call and optimistically assume support when both cuda properties are true and when FI_HMEM is available in the provider.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@aws-nslick aws-nslick requested review from bwbarrett and a team as code owners November 18, 2024 03:22
@aws-nslick aws-nslick marked this pull request as draft November 18, 2024 03:22
In the default case, we lazily create all fabric resources at the time
of communicator creation, such that they end up owned by the correct
thread and/or are resident on the correct cpu socket and memory domain.

previously, there existed an ugly dependency chain in our init: while
the large majority of the provider properties that we care about can be
extracted from fi_getinfo responses, some can only be effectively
queried by attempting mutations against an existing endpoint/domain/etc
and seeing if it failed or not. A further subset of these properties
need to be exposed back by nccl-net-ofi to nccl, at the time of
getProperties, and prior to communicator instantiation.

to work around this, late in init we pick a device, instantiate it,
query the attributes we need for getProperties, and then tear it all
down. This is expensive and delays our init, as well as exposing us to
bugs from incomplete teardown.

The sole case in the codebase today where this is necessary today is
around detecting gdr support for FI_HMEM_CUDA. With dmabuf now as the
default, it's relatively safe to just avoid the call and optimistically
assume support when both cuda properties are true and when FI_HMEM is
available in the provider.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
@rauteric
Copy link
Contributor

I'm skeptical this does the right thing on platforms like P3dn, where the flow is

  1. Libfabric returns FI_HMEM support in getinfo
  2. We try to set FI_OPT_CUDA_API_PERMITTED = false and it fails
  3. We conclude we don't have GDR.

@aws-nslick
Copy link
Contributor Author

aws-nslick commented Dec 13, 2024

@rauteric I'd expect that CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED returns false there, short circuiting the whole thing and ensuring we never attempt to advertise GDR support back to nccl at all, no?

edit: simple enough to test, will look at this again when I get a chance.

Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should tease apart the changes on cleaning up the support_gdr discovery and the code around the gdrflush cuda operation. There's really no need to couple those, and it really confuses the patch.

Today, Eric is right, we wanted to make sure we can disable CUDA in Libfabric and if we can't, we disable gdr. But I think we can simplify the code a ton by keying only off HMEM, having an environment variable to disable GDR support, and just erroring if we create an endpoint and can't disable CUDA (when the disable GDR is not set). Not being able to disable CUDA is a super edge case, and let's simplify the code by making the user deal with it. In that case, initialization flags + env var determine the support gdr flag, and we don't need the endpoint creation during init. Make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants