Skip to content

Commit

Permalink
[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown (
Browse files Browse the repository at this point in the history
pytorch#141511)

Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.

this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).

Pull Request resolved: pytorch#141511
Approved by: https://github.com/eqy, https://github.com/wconstab
ghstack dependencies: pytorch#141510
  • Loading branch information
kwen2501 authored and pytorchmergebot committed Dec 9, 2024
1 parent 4dbecf3 commit 5d3bc63
Showing 1 changed file with 10 additions and 11 deletions.
21 changes: 10 additions & 11 deletions torch/csrc/distributed/c10d/NCCLUtils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -219,19 +219,18 @@ class NCCLComm {
NCCLComm() = default;

~NCCLComm() noexcept {
// Add lock in this destructor, as aborted_ needs to be read after memory
// barrier here.
// (kwen2501) Making CUDA/NCCL calls in this destructor can hit CUDA driver
// shutdown error if CUDA context has exited first. Thus, we are not
// destroying or aborting NCCL communicators here. We just detect and warn
// about the risk of memory leak. Normally, a user would have called
// `destroy_process_group` or `abort_process_group`, and such risk would be
// avoided.
LockType lock(mutex_);
if (ncclComm_ && initialized_ && !aborted_) {
at::cuda::OptionalCUDAGuard gpuGuard(deviceIndex_);
#ifdef ENABLE_NCCL_ERROR_CHECKING
// Use ncclCommAbort instead of ncclCommDestroy here since
// ncclCommDestroy could block forever waiting for work to complete on
// the communicator.
C10D_NCCL_ASSERT(::ncclCommAbort(ncclComm_));
#else
C10D_NCCL_ASSERT(::ncclCommDestroy(ncclComm_));
#endif
TORCH_WARN_ONCE(
"WARNING: NCCL communicator hasn't been destroyed. This may cause "
"memory leaks. To avoid the risk, you can call `destroy_process_group` "
"during normal exit or `_abort_process_group` when handling failures.")
}
}

Expand Down

0 comments on commit 5d3bc63

Please sign in to comment.