Unable to access nvidia v100 gpu inside pod container #11520

abhishekkinikar · 2025-01-06T06:42:27Z

abhishekkinikar
Jan 6, 2025

I have created multi node k3s cluster
Control plane node has ubuntu 20.4 without gpu
Worker node nvidia dgx v100 with gpu

Deployed nvidia gpu operator which contains nvidia device plugin daemonset

gpu-operator   cvgpu                                                         1/1     Running     0             5d19h
gpu-operator   gpu-feature-discovery-shrbv                                   1/1     Running     0             5d20h
gpu-operator   gpu-operator-55566cdcc9-5kd82                                 1/1     Running     0             5d20h
gpu-operator   gpu-operator-node-feature-discovery-gc-7f546fd4bc-wfpnp       1/1     Running     0             5d20h
gpu-operator   gpu-operator-node-feature-discovery-master-8448c8896c-swntn   1/1     Running     0             5d20h
gpu-operator   gpu-operator-node-feature-discovery-worker-n55rz              1/1     Running     0             5d20h
gpu-operator   gpu-operator-node-feature-discovery-worker-z4dcx              1/1     Running     0             5d20h
gpu-operator   nvidia-container-toolkit-daemonset-8rl9p                      1/1     Running     0             5d20h
gpu-operator   nvidia-cuda-validator-zpfkr                                   0/1     Completed   0             5d20h
gpu-operator   nvidia-dcgm-exporter-2ddxx                                    1/1     Running     1             5d20h
gpu-operator   nvidia-device-plugin-daemonset-29xbv                          1/1     Running     1             5d20h
gpu-operator   nvidia-operator-validator-l28l5                               1/1     Running     0             5d20h
kube-system    coredns-56f6fc8fd7-jjkph                                      1/1     Running     1 (10d ago)   18d
kube-system    helm-install-traefik-2m78w                                    0/1     Completed   1             18d
kube-system    helm-install-traefik-crd-hkx99                                0/1     Completed   0             18d
kube-system    local-path-provisioner-5cf85fd84d-vctww                       1/1     Running     1 (10d ago)   18d
kube-system    metrics-server-5985cbc9d7-44htb                               1/1     Running     1 (10d ago)   18d
kube-system    svclb-traefik-4dad2eea-9trx9                                  2/2     Running     2 (10d ago)   18d
kube-system    svclb-traefik-4dad2eea-np9jx                                  2/2     Running     0             18d
kube-system    traefik-57b79cf995-rswdb                                      1/1     Running     1 (10d ago)   18d

Created pod using below .yaml

apiVersion: v1
kind: Pod
metadata:
  name: cvgpu
  namespace: gpu-operator
spec:
  containers:
  - name: cvgpu
    image: docker.io/nvidia/cuda:12.6.3-cudnn-devel-ubi8
    imagePullPolicy: IfNotPresent
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 30; done;"]
    resources:
      limits:
        nvidia.com/gpu: 2
  nodeSelector:
    disktype: ssd

After login to shell of pod unable to access gpu nvidia-smi command not found

On worker node everything is working fine drivers installed correctly and nvidia-smi is also working.

Please help with this issue.

abhishekkinikar · 2025-01-06T06:48:34Z

abhishekkinikar
Jan 6, 2025
Author

Node description showing allocatable gpu in capacity

0 replies

brandond · 2025-01-06T11:26:49Z

brandond
Jan 6, 2025
Collaborator

Please read the k3s docs on use of the Nvidia container runtime. Your pod spec does not specify use of the nvidia runtime class.

0 replies

sbhadr · 2025-01-07T03:39:34Z

sbhadr
Jan 7, 2025

Haha, I remember this being a giant pain in the ass. I'll save you (and anyone else) some trouble.

nvidia-smi.yaml

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  nodeSelector:
    nvidia.com/gpu.present: "true"
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: nvidia-smi
      image: nvidia/cuda:12.1.0-base-ubuntu22.04
      command: ['sh', '-c', "nvidia-smi"]
      resources:
        limits:
            nvidia.com/gpu: "1"

This will print the generic nvidia-smi log to the pod. If you see the GPU show up on that pod, it's usually a success. I have it where it goes to a specific node that has the GPU installed and so it should be successful for every pod that makes it there (all of them -- unless something goes very, very wrong).

FWIW -- Above response is in relation to runtimeClassName: nvidia being used. nvidia/cuda:12.1.0-base-ubuntu22.04 make sure this is correct for operating system you're using. Feel free to use the yaml I specified as a reference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to access nvidia v100 gpu inside pod container #11520

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unable to access nvidia v100 gpu inside pod container #11520

abhishekkinikar Jan 6, 2025

Replies: 3 comments

abhishekkinikar Jan 6, 2025 Author

brandond Jan 6, 2025 Collaborator

sbhadr Jan 7, 2025

abhishekkinikar
Jan 6, 2025

abhishekkinikar
Jan 6, 2025
Author

brandond
Jan 6, 2025
Collaborator

sbhadr
Jan 7, 2025