tree: cleanup "gdr_support" variable #711

aws-nslick · 2024-11-18T03:22:02Z

In the default case, we lazily create all fabric resources at the time of communicator creation, such that they end up owned by the correct thread and/or are resident on the correct cpu socket and memory domain.

previously, there existed an ugly dependency chain in our init: while the large majority of the provider properties that we care about can be extracted from fi_getinfo responses, some can only be effectively queried by attempting mutations against an existing endpoint/domain/etc and seeing if it failed or not. A further subset of these properties need to be exposed back by nccl-net-ofi to nccl, at the time of getProperties, and prior to communicator instantiation.

to work around this, late in init we pick a device, instantiate it, query the attributes we need for getProperties, and then tear it all down. This is expensive and delays our init, as well as exposing us to bugs from incomplete teardown.

The sole case in the codebase today where this is necessary today is around detecting gdr support for FI_HMEM_CUDA. With dmabuf now as the default, it's relatively safe to just avoid the call and optimistically assume support when both cuda properties are true and when FI_HMEM is available in the provider.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

In the default case, we lazily create all fabric resources at the time of communicator creation, such that they end up owned by the correct thread and/or are resident on the correct cpu socket and memory domain. previously, there existed an ugly dependency chain in our init: while the large majority of the provider properties that we care about can be extracted from fi_getinfo responses, some can only be effectively queried by attempting mutations against an existing endpoint/domain/etc and seeing if it failed or not. A further subset of these properties need to be exposed back by nccl-net-ofi to nccl, at the time of getProperties, and prior to communicator instantiation. to work around this, late in init we pick a device, instantiate it, query the attributes we need for getProperties, and then tear it all down. This is expensive and delays our init, as well as exposing us to bugs from incomplete teardown. The sole case in the codebase today where this is necessary today is around detecting gdr support for FI_HMEM_CUDA. With dmabuf now as the default, it's relatively safe to just avoid the call and optimistically assume support when both cuda properties are true and when FI_HMEM is available in the provider. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

rauteric · 2024-12-13T00:17:21Z

I'm skeptical this does the right thing on platforms like P3dn, where the flow is

Libfabric returns FI_HMEM support in getinfo
We try to set FI_OPT_CUDA_API_PERMITTED = false and it fails
We conclude we don't have GDR.

aws-nslick · 2024-12-13T10:00:04Z

@rauteric I'd expect that CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED returns false there, short circuiting the whole thing and ensuring we never attempt to advertise GDR support back to nccl at all, no?

edit: simple enough to test, will look at this again when I get a chance.

bwbarrett

We should tease apart the changes on cleaning up the support_gdr discovery and the code around the gdrflush cuda operation. There's really no need to couple those, and it really confuses the patch.

Today, Eric is right, we wanted to make sure we can disable CUDA in Libfabric and if we can't, we disable gdr. But I think we can simplify the code a ton by keying only off HMEM, having an environment variable to disable GDR support, and just erroring if we create an endpoint and can't disable CUDA (when the disable GDR is not set). Not being able to disable CUDA is a super edge case, and let's simplify the code by making the user deal with it. In that case, initialization flags + env var determine the support gdr flag, and we don't need the endpoint creation during init. Make sense?

aws-nslick requested review from bwbarrett and a team as code owners November 18, 2024 03:22

aws-nslick marked this pull request as draft November 18, 2024 03:22

aws-nslick force-pushed the full-lazy-init branch from 7d0533c to cd45df6 Compare November 26, 2024 00:04

bwbarrett requested changes Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tree: cleanup "gdr_support" variable #711

tree: cleanup "gdr_support" variable #711

aws-nslick commented Nov 18, 2024

rauteric commented Dec 13, 2024

aws-nslick commented Dec 13, 2024 •

edited

Loading

bwbarrett left a comment

tree: cleanup "gdr_support" variable #711

Are you sure you want to change the base?

tree: cleanup "gdr_support" variable #711

Conversation

aws-nslick commented Nov 18, 2024

rauteric commented Dec 13, 2024

aws-nslick commented Dec 13, 2024 • edited Loading

bwbarrett left a comment

Choose a reason for hiding this comment

aws-nslick commented Dec 13, 2024 •

edited

Loading