You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.
I found a runtime error while running the code:
The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out)
using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py
Environment
The text was updated successfully, but these errors were encountered:
Thx for your reply. But I found the problem still unsolved.
However, I found the following command could help:
python3.8 -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train.py
The network starts training and gets target accurancy.
The question is, what is the difference between the two commands?
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
🐛 Describe the bug
I found a runtime error while running the code:
The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out)
using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py
Environment
The text was updated successfully, but these errors were encountered: