Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown (
pytorch#141511) Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error. this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning). Pull Request resolved: pytorch#141511 Approved by: https://github.com/eqy, https://github.com/wconstab ghstack dependencies: pytorch#141510
- Loading branch information