-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questions about applying for nodes and gpus #558
Comments
@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API? |
or you can consider upgrading to the v2beta API :) To answer some of your questions:
In that case, you might want to set the number of GPUs per worker to 1 (along with |
Thanks for your reply~ |
Thanks for your reply~
I have applied the example yaml in this way successfully, but it seems that 4 gpus are separately be used by 4 pods, and what each worker excuted is a single-gpu training. So it's not distributed training( in this case, i means multi-node with singe-gpu per node training), and whole process takes more time than single-gpu training in one pod which set "replicas=1". What confused me is that, the value of "replicas" seems to only serve as a multiplier for "nvidia.com/gpu".
|
That is correct. @tenzen-y's point is that the v1 implementation is no longer hosted in this repo. The rest of the questions:
|
@ThomaswellY Also, I would suggest |
Also it seems that #549 has proof that v2beta1 can run deepspeed |
@alculquicondor @tenzen-y thanks for your kind help! maybe i should use v2beta1 for deepspeed. |
@ThomaswellY Thank you for the report!
Feel free to open PRs. I'm happy to review them :) |
Hi, i have been using mpi-operator to achieve distributed training recently。
the most command i used is “kubectl apply -f yaml”. Let me take the mpi-operator yaml for example
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: cifar
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
nodeName:
containers:
- image: 10.252.39.13:5000/deepspeed_ms:v2
name: mpijob-cifar-deepspeed-container
imagePullPolicy: Always
command:
- mpirun
- --allow-run-as-root
- python
- cifar/cifar10_deepspeed.py
- --epochs=100
- --deepspeed_mpi
- --deepspeed
- --deepspeed_config
- cifar/ds_config.json
env:
- name: OMP_NUM_THREADS
value: "1"
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
nodeName:
containers:
- image: 10.252.39.13:5000/deepspeed_ms:v2
name: deepspeed-mpijob-container
resources:
limits:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 2
there are some questions i'm confused about:
*When replicas is set to large numerber, It takes a bit more time for the cifar-launcher pod to complete.
*the logs printed in cifar-launcher pod (when replicas: 4) were just like the result ( when replicas: 1) repeated 4 times.
so does these mean, the four pods have separately applyed for one gpu (from node in k8s cluster, and preferentially from the same node if gpus are enough), and printed out the average result. the whole process had nothing to do with distribution?
*by the way, when setting "repicas: 3" , there is error reported in my case:
train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 64 !=21 * 1 * 3
this did confuse me.
Thank in advance for your apply~
The text was updated successfully, but these errors were encountered: