Controller is not launching more pods even though there's a lot of jobs queued #1492
Unanswered
bmbferreira
asked this question in
Questions
Replies: 2 comments 1 reply
-
Controller logs would help |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @toast-gear! I'll try to get them again when I see this behaviour happening again. Meanwhile, can this issue be related with this? I think my configuration for the |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi! I'm having a bad time trying to understand why the controller is not launching more pods even though I have an enormous queue of jobs waiting to be executed.
The situation is, I configured the autoscaling based on the recommended workflow_job webhook and I'm setting a minumum of 1 replica and a max of ten. The configuration is this:
What is happening is that we have some jobs that are getting stuck with an external tool and sometimes the pods stay running for like 2 hours before it completes. However, having these long running jobs that are indeed a problem, I never see the number of pods reaching the max number of replicas that I configured (20). I get like 3/4 pods running for a couple of hours and an enormous queue of jobs queued. If I delete manually these long running jobs then the queue starts to recover.
What am I missing? I was expecting to have the max number of replicas being used even if I have other jobs running for a long time.
Also, the infrastructure is not a problem because the cluster autoscaler is working fine and is launching new nodes when new pods are launched, the problem here seems to be the controller because I don't see new pods starting for the queued jobs.
Thanks in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions