Lower Default `terminated-pod-gc-threshold` to Prevent Excessive Accumulation of Failed Pods #760

andy108369 · 2024-12-20T11:24:20Z

andy108369
Dec 20, 2024
Collaborator

The current default value of terminated-pod-gc-threshold in Kubernetes (12500) allows a large number of failed pods to accumulate in a cluster, which can lead to operational inefficiencies. While pod failures, such as those caused by exceeding ephemeral storage limits, are expected in scenarios where tenants allocate less storage than applications require, the excessive accumulation of failed pods introduces unnecessary overhead.

Observed Issue:

In scenarios where pods fail due to expected resource constraints (e.g., ephemeral storage limits), the cluster rapidly accumulates more than 10 failed pods.
Failed pods in Error or ContainerStatusUnknown states are retained unnecessarily, creating clutter in namespaces and increasing etcd storage usage, which can affect API performance and operational management.

Real example of the Issue:

root@control-01:~# kubectl -n lease get manifest $ns -o yaml
apiVersion: akash.network/v2beta2
kind: Manifest
metadata:
  creationTimestamp: "2024-12-19T04:05:06Z"
  generation: 1
  labels:
    akash.network: "true"
    akash.network/lease.id.dseq: "19417619"
    akash.network/lease.id.gseq: "1"
    akash.network/lease.id.oseq: "1"
    akash.network/lease.id.owner: REDACTED
    akash.network/lease.id.provider: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
    akash.network/namespace: 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm
  name: 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm
  namespace: lease
  resourceVersion: "81255006"
  uid: 0a5bc2e1-472a-4867-af62-99de137ea3d6
spec:
  group:
    name: dcloud
    services:
    - count: 1
      expose:
      - endpoint_sequence_number: 0
        external_port: 80
        global: true
        http_options:
          max_body_size: 1048576
          next_cases:
          - error
          - timeout
          next_tries: 3
          read_timeout: 60000
          send_timeout: 60000
        port: 80
        proto: TCP
      image: andrey01/falcon7b:0.4
      name: service-1
      resources:
        cpu:
          units: 100
        gpu:
          units: 0
        id: 1
        memory:
          size: "536870912"
        storage:
        - name: default
          size: "1073741824"
  lease_id:
    dseq: "19417619"
    gseq: 1
    oseq: 1
    owner: REDACTED
    provider: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
root@control-01:~#

root@control-01:~# kubectl -n $ns get pods -o wide
NAME                         READY   STATUS                   RESTARTS   AGE     IP              NODE                   NOMINATED NODE   READINESS GATES
service-1-587f6f88bc-2zlnh   0/1     Error                    0          13m     10.233.73.148   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-7dm65   0/1     Error                    0          32m     10.233.73.159   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-8hnmm   1/1     Running                  0          42s     10.233.73.152   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-8sdmp   0/1     Error                    0          11m     10.233.73.159   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-b7z7d   0/1     ContainerStatusUnknown   1          22m     10.233.73.152   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-btfsd   0/1     Error                    0          31m     10.233.73.151   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-g64rn   0/1     Error                    0          27m     10.233.73.148   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-g67nt   0/1     Error                    0          38m     10.233.73.151   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-gzgxm   0/1     ContainerStatusUnknown   1          4m20s   10.233.73.159   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-h99nw   0/1     Error                    0          25m     10.233.73.159   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-hqzfr   0/1     Error                    0          34m     10.233.73.148   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-jfnmd   0/1     Error                    0          24m     10.233.73.151   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-jzwpl   0/1     ContainerStatusUnknown   1          2m26s   10.233.73.151   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-knldh   0/1     Error                    0          20m     10.233.73.148   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-kzjrn   0/1     Error                    0          6m3s    10.233.73.148   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-m4cdb   0/1     Error                    0          7m48s   10.233.73.152   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-pnldm   0/1     Error                    0          29m     10.233.73.152   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-rtqtp   0/1     ContainerStatusUnknown   1          9m41s   10.233.73.151   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-rzftx   0/1     Error                    0          15m     10.233.73.152   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-sl7mt   0/1     ContainerStatusUnknown   1          16m     10.233.73.151   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-t48t5   0/1     ContainerStatusUnknown   1          36m     10.233.73.152   worker-01.hurricane2   <none>           <none>
service-1-587f6f88bc-zkflx   0/1     Error                    0          18m     10.233.73.159   worker-01.hurricane2   <none>           <none>

root@control-01:~# kubectl -n $ns get rs
NAME                   DESIRED   CURRENT   READY   AGE
service-1-587f6f88bc   1         1         1       30h
root@control-01:~#

root@control-01:~# kubectl get events -A --sort-by='{.metadata.creationTimestamp}' 
...
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   11m         Normal    Pulled                pod/service-1-587f6f88bc-8sdmp                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   11m         Normal    Started               pod/service-1-587f6f88bc-8sdmp                                    Started container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   11m         Normal    Created               pod/service-1-587f6f88bc-8sdmp                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   10m         Normal    Killing               pod/service-1-587f6f88bc-8sdmp                                    Stopping container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   10m         Warning   Evicted               pod/service-1-587f6f88bc-8sdmp                                    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   10m         Warning   ExceededGracePeriod   pod/service-1-587f6f88bc-8sdmp                                    Container runtime did not kill the pod within specified grace period.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   9m46s       Normal    Created               pod/service-1-587f6f88bc-rtqtp                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   9m46s       Normal    Pulled                pod/service-1-587f6f88bc-rtqtp                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   9m46s       Normal    Scheduled             pod/service-1-587f6f88bc-rtqtp                                    Successfully assigned 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm/service-1-587f6f88bc-rtqtp to worker-01.hurricane2
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   9m45s       Normal    Started               pod/service-1-587f6f88bc-rtqtp                                    Started container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   8m24s       Normal    Killing               pod/service-1-587f6f88bc-rtqtp                                    Stopping container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   8m24s       Warning   Evicted               pod/service-1-587f6f88bc-rtqtp                                    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   8m14s       Warning   ExceededGracePeriod   pod/service-1-587f6f88bc-rtqtp                                    Container runtime did not kill the pod within specified grace period.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   7m52s       Normal    Scheduled             pod/service-1-587f6f88bc-m4cdb                                    Successfully assigned 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm/service-1-587f6f88bc-m4cdb to worker-01.hurricane2
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   7m52s       Normal    Pulled                pod/service-1-587f6f88bc-m4cdb                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   7m52s       Normal    Created               pod/service-1-587f6f88bc-m4cdb                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   7m51s       Normal    Started               pod/service-1-587f6f88bc-m4cdb                                    Started container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m40s       Warning   Evicted               pod/service-1-587f6f88bc-m4cdb                                    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m40s       Normal    Killing               pod/service-1-587f6f88bc-m4cdb                                    Stopping container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m30s       Warning   ExceededGracePeriod   pod/service-1-587f6f88bc-m4cdb                                    Container runtime did not kill the pod within specified grace period.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m8s        Normal    Created               pod/service-1-587f6f88bc-kzjrn                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m8s        Normal    Pulled                pod/service-1-587f6f88bc-kzjrn                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m8s        Normal    Scheduled             pod/service-1-587f6f88bc-kzjrn                                    Successfully assigned 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm/service-1-587f6f88bc-kzjrn to worker-01.hurricane2
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   6m7s        Normal    Started               pod/service-1-587f6f88bc-kzjrn                                    Started container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m56s       Warning   Evicted               pod/service-1-587f6f88bc-kzjrn                                    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m56s       Normal    Killing               pod/service-1-587f6f88bc-kzjrn                                    Stopping container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m46s       Warning   ExceededGracePeriod   pod/service-1-587f6f88bc-kzjrn                                    Container runtime did not kill the pod within specified grace period.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m25s       Normal    Scheduled             pod/service-1-587f6f88bc-gzgxm                                    Successfully assigned 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm/service-1-587f6f88bc-gzgxm to worker-01.hurricane2
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m24s       Normal    Started               pod/service-1-587f6f88bc-gzgxm                                    Started container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m24s       Normal    Created               pod/service-1-587f6f88bc-gzgxm                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   4m24s       Normal    Pulled                pod/service-1-587f6f88bc-gzgxm                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   3m2s        Normal    Killing               pod/service-1-587f6f88bc-gzgxm                                    Stopping container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   3m2s        Warning   Evicted               pod/service-1-587f6f88bc-gzgxm                                    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   2m52s       Warning   ExceededGracePeriod   pod/service-1-587f6f88bc-gzgxm                                    Container runtime did not kill the pod within specified grace period.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   2m31s       Normal    Scheduled             pod/service-1-587f6f88bc-jzwpl                                    Successfully assigned 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm/service-1-587f6f88bc-jzwpl to worker-01.hurricane2
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   2m30s       Normal    Started               pod/service-1-587f6f88bc-jzwpl                                    Started container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   2m30s       Normal    Pulled                pod/service-1-587f6f88bc-jzwpl                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   2m30s       Normal    Created               pod/service-1-587f6f88bc-jzwpl                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   79s         Normal    Killing               pod/service-1-587f6f88bc-jzwpl                                    Stopping container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   79s         Warning   Evicted               pod/service-1-587f6f88bc-jzwpl                                    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   69s         Warning   ExceededGracePeriod   pod/service-1-587f6f88bc-jzwpl                                    Container runtime did not kill the pod within specified grace period.
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   47s         Normal    Scheduled             pod/service-1-587f6f88bc-8hnmm                                    Successfully assigned 7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm/service-1-587f6f88bc-8hnmm to worker-01.hurricane2
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   46s         Normal    Pulled                pod/service-1-587f6f88bc-8hnmm                                    Container image "andrey01/falcon7b:0.4" already present on machine
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   46s         Normal    Created               pod/service-1-587f6f88bc-8hnmm                                    Created container service-1
7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm   46s         Normal    Started               pod/service-1-587f6f88bc-8hnmm                                    Started container service-1

Suggested Change:

Lower the default value of terminated-pod-gc-threshold to 10 to:

Prevent the excessive buildup of failed pods in clusters with high pod churn.
Ensure timely garbage collection and improve overall cluster hygiene.
Simplify debugging and resource tracking by maintaining a manageable number of terminated pods.

Rationale:

This adjustment ensures that the garbage collection mechanism aligns with real-world scenarios where pod failures are anticipated but should not result in unbounded accumulation of terminated pods. A lower threshold strikes a balance between retaining recent pod history for troubleshooting and maintaining an efficient cluster state.

Would appreciate feedback from the community on the feasibility and implications of this proposed change.

andy108369 · 2024-12-20T11:36:05Z

andy108369
Dec 20, 2024
Collaborator Author

I have a hunch that this sort of issue might be the one that contributed (not caused, but contributed) to the following issues:

communication issues with the RPC node (eventually causing lease removal Akash Provider incorrectly removes deployments during RPC communication issues; improve resilience to intermittent RPC failures support#17 );
and potentially the issue such as kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>)) on send-manifest (on tx deployment update) support#152 - where deployment would get removed upon send-manifest, provider v0.6.5-rc6 fixes this issue (I left a comment in the issue back then);

I did see the kubectl slowness on the Hurricane provider and even a failure to respond as failed pods kept accumulating until I cleared them up by running kubectl delete pods -A --field-selector status.phase=Failed

I also noticed that upon send-manifest earlier today (was updating the image tag), the provider did say it updated the deployment, however it didn't spawn a new replicaset/pod. Only the 2nd attempt of send-manifest did update the deployment.

e.g. there were 879 failed pods on the provider until I cleared them up:

https://gist.github.com/andy108369/b277e5b27fd9d18f089bc47914c5b2b8

root@control-01:~# jobs
[1]+  Running                 kubectl delete pods -A --field-selector status.phase=Failed &
root@control-01:~# ns=7vmqhnt4rkv5odetj1921ttoh2vu8slrgqp6kuhdpmngm
root@control-01:~# kubectl -n $ns get pods -o wide
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
root@control-01:~# pod "service-1-587f6f88bc-5s6dg" deleted
pod "service-1-587f6f88bc-5sfzc" deleted
pod "service-1-587f6f88bc-5vfs4" deleted
Error from server (Timeout): Timeout: request did not complete within requested timeout - context deadline exceeded
Error from server (Timeout): Timeout: request did not complete within requested timeout - context deadline exceeded
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

The Kubernetes change should rather be simple:

kubespray-based K8s: https://github.com/kubernetes-sigs/kubespray/blob/v2.26.0/roles/kubernetes/control-plane/defaults/main/main.yml#L89
k3s-based K8s: terminated-pod-gc-threshold appears to have no effect k3s-io/k3s#10448 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akash Network

Lower Default `terminated-pod-gc-threshold` to Prevent Excessive Accumulation of Failed Pods #760

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Akash Network

Lower Default terminated-pod-gc-threshold to Prevent Excessive Accumulation of Failed Pods #760

andy108369 Dec 20, 2024 Collaborator

Observed Issue:

Real example of the Issue:

Suggested Change:

Rationale:

Replies: 1 comment

andy108369 Dec 20, 2024 Collaborator Author

Lower Default `terminated-pod-gc-threshold` to Prevent Excessive Accumulation of Failed Pods #760

andy108369
Dec 20, 2024
Collaborator

andy108369
Dec 20, 2024
Collaborator Author