Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

The error happened when I did multi-node distributed training #180

Open
ShangWeiKuo opened this issue Oct 21, 2022 · 1 comment
Open

The error happened when I did multi-node distributed training #180

ShangWeiKuo opened this issue Oct 21, 2022 · 1 comment

Comments

@ShangWeiKuo
Copy link

🐛 Describe the bug

Excuse me. When I enter the command "colossalai run --nproc_per_node 4 --host [host1 ip addr],[host2 ip addr] --master_addr [host1 ip addr] train.py", I got this message: Error: failed to run torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=[host1 ip addr]:29500 --rdzv_id=colossalai-default-job train.py on [host2 ip addr]

What are the configurations I have to set in the train.py you provided with?

Environment

CUDA Version: 11.3
PyTorch Version: 1.12.0
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓

@FrankLeeeee
Copy link
Contributor

Hi, are these two machines connected by ssh without password?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants