tree: cleanup "gdr_support" variable #711
Draft
+218
−324
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the default case, we lazily create all fabric resources at the time of communicator creation, such that they end up owned by the correct thread and/or are resident on the correct cpu socket and memory domain.
previously, there existed an ugly dependency chain in our init: while the large majority of the provider properties that we care about can be extracted from fi_getinfo responses, some can only be effectively queried by attempting mutations against an existing endpoint/domain/etc and seeing if it failed or not. A further subset of these properties need to be exposed back by nccl-net-ofi to nccl, at the time of getProperties, and prior to communicator instantiation.
to work around this, late in init we pick a device, instantiate it, query the attributes we need for getProperties, and then tear it all down. This is expensive and delays our init, as well as exposing us to bugs from incomplete teardown.
The sole case in the codebase today where this is necessary today is around detecting gdr support for FI_HMEM_CUDA. With dmabuf now as the default, it's relatively safe to just avoid the call and optimistically assume support when both cuda properties are true and when FI_HMEM is available in the provider.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.