Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU page discuss specific restrictions of socket/gpu binding on GPU nodes. #834

Open
wwarriner opened this issue Oct 11, 2024 · 1 comment
Labels
fabric: cheaha Docs related to Cheaha platform

Comments

@wwarriner
Copy link
Contributor

What would you like to see added?

CPU affinity appears in gres.conf and is mapped based on hardware architecture. Affinity means that only certain physical cores are associated with each physical GPU on any GPU node for performance reasons. Cores mapped to a GPU have faster access to that GPU than cores not mapped. On the hardware, this mapping cannot be changed as it is part of the physical layout of the devices. Slurm cannot determine this on its own, so it must be instructed via gres.conf.

In practice, CPU affinity limits the ratio of cores to GPU when requesting GPUs for jobs. Setting aside QoS, if a researcher requests a single GPU and more cores than in the table below, they will potentially get multiple nodes. If they try to force a higher core count to be on a single node with --nodes=1 then the job will get stuck in queue with ReqNodeNotAvail.

The table below ignore QoS limits.

partition max cores:gpu from affinity max cores for 1 gpu
pascal* 14:1 14
ampere* 64:1 64
@wwarriner wwarriner added the fabric: cheaha Docs related to Cheaha platform label Oct 11, 2024
@mdefende
Copy link
Member

This is correct to my understanding. Requesting more than the specified cores will cause the job to request over 2 different nodes which means some of the requested resources will be unavailable to the job but still allocated to it. Someone can request all of the cores on a single pascalnode by requesting at least 2 GPUs. This isn't important for the A100s right now because the per user QoS limits any person to 64 cores in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric: cheaha Docs related to Cheaha platform
Projects
None yet
Development

No branches or pull requests

2 participants