-
We have a set of Kubernetes clusters set up with agents connecting them to a Teleport clusters. For all clusters this works and has worked for a long time, but recently in one of our new clusters the agent keeps having connection issues / crashing. We get the following logs in the agent:
The readiness probe then constantly fails with 400 and 503 errors. This agent has the exact same manifest/setup as all our other agents, which do work (in fact they're all deployed as an ArgoCD AppSet), but this one agent is the only one that keeps breaking / not working. The cluster runs on Azure, we have other Azure clusters that do in fact work. Does anyone have an idea on where to get started with fixing this issue? We've been looking into it for a while but we really cannot figure it out. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
any updates on this? i'm having the same trouble. |
Beta Was this translation helpful? Give feedback.
-
Same sort of failures suddenly started appearing here. |
Beta Was this translation helpful? Give feedback.
-
This is likely an issue to do with the pod not having valid instance credentials (which are used for sending version information to the control plane) and not being able to use its existing join method to generate new ones, or some kind of internal UUID mismatch. The simplest way to fix this is to delete the pod's state secret which holds its credentials, then restart the pod so it joins the cluster as a fresh agent. This requires that you have a valid join method configured in the chart values, so you should update your values and run a The secret is in the same namespace as the agent pod and called |
Beta Was this translation helpful? Give feedback.
-
Can confirm, deleting the secret and pod for our failing agent worked! The container restarted, the secret was recreated, and it joined just fine. For reference, we are configured to use IAM roles for agents joining the cluster, so this was very easy. Had we still been using the older method of having to create a short-lived token on the cluster for the initial join, and then the agents refreshing their internal access method regularly once that initial join has occurred, this would have been more of a hassle to do. |
Beta Was this translation helpful? Give feedback.
This is likely an issue to do with the pod not having valid instance credentials (which are used for sending version information to the control plane) and not being able to use its existing join method to generate new ones, or some kind of internal UUID mismatch.
The simplest way to fix this is to delete the pod's state secret which holds its credentials, then restart the pod so it joins the cluster as a fresh agent. This requires that you have a valid join method configured in the chart values, so you should update your values and run a
helm upgrade
first if needed.The secret is in the same namespace as the agent pod and called
<helm-release-name>-0-state
.