Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

connection failure #207

Open
lhj-git opened this issue Jan 4, 2023 · 2 comments
Open

connection failure #207

lhj-git opened this issue Jan 4, 2023 · 2 comments

Comments

@lhj-git
Copy link

lhj-git commented Jan 4, 2023

🐛 Describe the bug

I found a runtime error while running the code:
The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out)
using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py

Environment

image

@FrankLeeeee
Copy link
Contributor

Can you try this

colossalai run --nproc_per_node 4 --master_port 29505 --master_addr 127.0.0.1 train.py

@lhj-git
Copy link
Author

lhj-git commented Jan 4, 2023

Thx for your reply. But I found the problem still unsolved.
However, I found the following command could help:
python3.8 -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train.py
The network starts training and gets target accurancy.
The question is, what is the difference between the two commands?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants