[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown (

pytorch#141511) Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error. this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning). Pull Request resolved: pytorch#141511 Approved by: https://github.com/eqy, https://github.com/wconstab ghstack dependencies: pytorch#141510
poolsideai · Dec 9, 2024 · 5d3bc63 · 5d3bc63
1 parent 4dbecf3
commit 5d3bc63
Showing 1 changed file with 10 additions and 11 deletions.
diff --git a/torch/csrc/distributed/c10d/NCCLUtils.hpp b/torch/csrc/distributed/c10d/NCCLUtils.hpp
@@ -219,19 +219,18 @@ class NCCLComm {
   NCCLComm() = default;
 
   ~NCCLComm() noexcept {
-    // Add lock in this destructor, as aborted_ needs to be read after memory
-    // barrier here.
+    // (kwen2501) Making CUDA/NCCL calls in this destructor can hit CUDA driver
+    // shutdown error if CUDA context has exited first. Thus, we are not
+    // destroying or aborting NCCL communicators here. We just detect and warn
+    // about the risk of memory leak. Normally, a user would have called
+    // `destroy_process_group` or `abort_process_group`, and such risk would be
+    // avoided.
     LockType lock(mutex_);
     if (ncclComm_ && initialized_ && !aborted_) {
-      at::cuda::OptionalCUDAGuard gpuGuard(deviceIndex_);
-#ifdef ENABLE_NCCL_ERROR_CHECKING
-      // Use ncclCommAbort instead of ncclCommDestroy here since
-      // ncclCommDestroy could block forever waiting for work to complete on
-      // the communicator.
-      C10D_NCCL_ASSERT(::ncclCommAbort(ncclComm_));
-#else
-      C10D_NCCL_ASSERT(::ncclCommDestroy(ncclComm_));
-#endif
+      TORCH_WARN_ONCE(
+          "WARNING: NCCL communicator hasn't been destroyed. This may cause "
+          "memory leaks. To avoid the risk, you can call `destroy_process_group` "
+          "during normal exit or `_abort_process_group` when handling failures.")
     }
   }