Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

Closed
4 tasks done
emosbaugh opened this issue Oct 23, 2024 · 17 comments · Fixed by #5151
Labels

Comments

@emosbaugh
Copy link
Contributor

emosbaugh commented Oct 23, 2024

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

No response

Version

v1.28.14+k0s.0

Sysinfo

`k0s sysinfo`
➡️ Please replace this text with the output of `k0s sysinfo`. ⬅️

What happened?

When I join many controller nodes in parallel, the kubernetes api can become unstable for a period. This results in the initial etcd join failing to sync the etcd config and the k0s process exiting.

Oct 18 04:04:39 node-a50a2-04 k0s[3461]: time="2024-10-18 04:04:39" level=info msg="starting Etcd"
Oct 18 04:04:39 node-a50a2-04 k0s[3461]: time="2024-10-18 04:04:39" level=info msg="Starting etcd"
Oct 18 04:04:50 node-a50a2-04 k0s[3461]: Error: failed to start controller node components: failed to sync etcd config: unexpected response status when trying to join etcd cluster: 500 Internal Server Error
Oct 18 04:04:50 node-a50a2-04 systemd[1]: k0scontroller.service: Main process exited, code=exited, status=1/FAILURE
Oct 18 04:04:50 node-a50a2-04 systemd[1]: k0scontroller.service: Failed with result 'exit-code'.
Oct 18 04:04:50 node-a50a2-04 systemd[1]: k0scontroller.service: Consumed 3.172s CPU time.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: k0scontroller.service: Scheduled restart job, restart counter is at 1.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: Stopped k0scontroller.service - k0s - Zero Friction Kubernetes.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: k0scontroller.service: Consumed 3.172s CPU time.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: Started k0scontroller.service - k0s - Zero Friction Kubernetes.

When k0s starts back up, rather than join the cluster, it seems to create a new cluster.

Oct 18 04:05:02 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:02" level=info msg="starting Etcd"
Oct 18 04:05:02 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:02" level=info msg="Starting etcd"
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="Starting to supervise" component=etcd
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="Started successfully, go nuts pid 3537" component=etcd
...
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="{\"level\":\"info\",\"ts\":\"2024-10-18T04:05:03.727336Z\",\"caller\":\"etcdmain/etcd.go:73\",\"msg\":\"Running: \",\"args\":[\"/var/lib/embedded-cluster/k0s/bin/etcd\",\"--tls-min-version=TLS1.2\",\"--data-dir=/var/lib/embedded-cluster/k0s/etcd\",\"--name=node-a50a2-04\",\"--key-file=/var/lib/embedded-cluster/k0s/pki/etcd/server.key\",\"--peer-trusted-ca-file=/var/lib/embedded-cluster/k0s/pki/etcd/ca.crt\",\"--peer-key-file=/var/lib/embedded-cluster/k0s/pki/etcd/peer.key\",\"--peer-cert-file=/var/lib/embedded-cluster/k0s/pki/etcd/peer.crt\",\"--listen-client-urls=https://127.0.0.1:2379\",\"--listen-peer-urls=https://10.0.0.6:2380\",\"--log-level=info\",\"--auth-token=jwt,pub-key=/var/lib/embedded-cluster/k0s/pki/etcd/jwt.pub,priv-key=/var/lib/embedded-cluster/k0s/pki/etcd/jwt.key,sign-method=RS512,ttl=10m\",\"--cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256\",\"--advertise-client-urls=https://127.0.0.1:2379\",\"--client-cert-auth=true\",\"--peer-client-cert-auth=true\",\"--enable-pprof=false\",\"--initial-advertise-peer-urls=https://10.0.0.6:2380\",\"--cert-file=/var/lib/embedded-cluster/k0s/pki/etcd/server.crt\",\"--trusted-ca-file=/var/lib/embedded-cluster/k0s/pki/etcd/ca.crt\"]}" component=etcd stream=stderr
...
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="{\"level\":\"info\",\"ts\":\"2024-10-18T04:05:03.839004Z\",\"caller\":\"etcdserver/server.go:738\",\"msg\":\"started as single-node; fast-forwarding election ticks\",\"local-member-id\":\"40e0ff5ee27c98d0\",\"forward-ticks\":9,\"forward-duration\":\"900ms\",\"election-ticks\":10,\"election-timeout\":\"1s\"}" component=etcd stream=stderr

This seems to be due to a bad assumption in this function

// If we've got CA in place we assume the node has already joined previously
func (c *command) needToJoin(nodeConfig *v1beta1.ClusterConfig) bool {
	if file.Exists(filepath.Join(c.K0sVars.CertRootDir, "ca.key")) &&
		file.Exists(filepath.Join(c.K0sVars.CertRootDir, "ca.crt")) {
		return false
	}
	if nodeConfig.Spec.Storage.Type == v1beta1.EtcdStorageType && !nodeConfig.Spec.Storage.Etcd.IsExternalClusterUsed() {
		return !file.Exists(filepath.Join(c.K0sVars.EtcdDataDir, "member", "snap", "db"))
	}
	return true
}

Steps to reproduce

  1. Join many controller nodes in parallel

Expected behavior

When joining a node, it will join the existing cluster

Actual behavior

Joining a node creates a new cluster in some circumstances

Screenshots and logs

k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt

Additional context

No response

@emosbaugh emosbaugh added the bug Something isn't working label Oct 23, 2024
@twz123
Copy link
Member

twz123 commented Oct 23, 2024

What would be a better way to determine if an existing cluster should be joined? I could imagine that k0s could delete the certs if joining the cluster fails, too...

@twz123
Copy link
Member

twz123 commented Oct 23, 2024

Also: k0s could retry 5xx responses in a back-off loop...

@twz123
Copy link
Member

twz123 commented Oct 23, 2024

@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places?

@emosbaugh
Copy link
Contributor Author

Also: k0s could retry 5xx responses in a back-off loop...

This already happens today but eventually it gives up in this case

https://github.com/k0sproject/k0s/blob/main/pkg/component/controller/etcd.go#L103-L115

@emosbaugh
Copy link
Contributor Author

@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places?

That makes sense to me. Is there a directory and path that would be appropriate to store this file?

@emosbaugh
Copy link
Contributor Author

@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places?

That makes sense to me. Is there a directory and path that would be appropriate to store this file?

Thinking a bit more about this... I feel like it is better to have a single source of truth, ideally etcd itself. We could use the etcd database file for that as i have it or perhaps use the result of syncEtcdConfig to detect if the current node is already a member of the cluster.

@jnummelin
Copy link
Member

Not sure I grokked this correctly, but I think the problem is not fully on the joining side of the code but also on the join api side of things. I mean what happens is that with 1 etcd member, you now create join request for 2 more in parallel. What happens on etcd is that once we create the new member on node 1, but node 2 has not really joined the cluster yet (etcd hasn't been started yet), there's no quorum. And at the same time we do the same for node 3. So we basically bork etcd ourselves.

I feel like it is better to have a single source of truth, ideally etcd itself.

I agree with this. And reflecting this to what I wrote above, I think the join API should actually check etcd state more closely to see if we can actually allow another member to join. I mean when we have 1 member up, we can only allow 1 more to be fully joined (member created and actually reached quorum). Only after this we can allow for the next one and so on.

On the join api side of things we'd probably want to use some suitable HTTP status code to tell that "I cannot allow you to join at this time as there's no quorum, try again in a bit". Maybe 503 with some Retry-After header.

@twz123
Copy link
Member

twz123 commented Oct 30, 2024

Thinking a bit more about this... I feel like it is better to have a single source of truth, ideally etcd itself. We could use the etcd database file for that as i have it or perhaps use the result of syncEtcdConfig to detect if the current node is already a member of the cluster.

I kinda like the marker file because it's a) super-dumb to implement, b) super-easy to delete, in case somebody wants to enforce a rejoin, and c) backend agnositc. It would also work with, say kine/NATS types of setups.

@twz123
Copy link
Member

twz123 commented Oct 30, 2024

On the join api side of things we'd probably want to use some suitable HTTP status code to tell that "I cannot allow you to join at this time as there's no quorum, try again in a bit". Maybe 503 with some Retry-After header.

Agree: #5149 (comment)

@emosbaugh
Copy link
Contributor Author

emosbaugh commented Oct 30, 2024

I agree that the issue is that the api is unstable when many nodes are joining and there is no quorum. Eventually the api will become stable in my scenario and the node will be able to join. It just does not wait long enough.

When it does give up and restart, it checks the pki certs incorrectly to decide if it has already joined, which it has not. When it sees that they exist and determines that it does not need to join, it starts a new cluster rather than joining an existing cluster (joinClient is nil). If you run kubectl get node on the node that started its own cluster it will be 1 of 1 node in a healthy cluster.

Are you suggesting that we continue to retry forever until join is successful? What if in other scenarios the api does not become healthy? What if the process restarts for another reason? It will still exhibit the same behavior in creating a new cluster. Therefore in my opinion, the real issue here is not one of backoff/retry but the check for "do i need to join?".

@jnummelin
Copy link
Member

Are you suggesting that we continue to retry forever until join is successful?

I don't think we want to wait forever, but for longer than we currently do for sure.

Therefore in my opinion, the real issue here is not one of backoff/retry but the check for "do i need to join?".

Absolutely, that is part of the problem. And maybe the most important part. I'm just pointing out that the api side has some issues too which we want to address too. And by fixing both sides we make it much more robust.

@emosbaugh
Copy link
Contributor Author

Thinking a bit more about this... I feel like it is better to have a single source of truth, ideally etcd itself. We could use the etcd database file for that as i have it or perhaps use the result of syncEtcdConfig to detect if the current node is already a member of the cluster.

I kinda like the marker file because it's a) super-dumb to implement, b) super-easy to delete, in case somebody wants to enforce a rejoin, and c) backend agnositc. It would also work with, say kine/NATS types of setups.

@twz123 Although I'm a little unclear of my understanding of the state of the filesystem upon certain errors, I'm concerned that if the file marker we missing and there were an etcd database then the only way to proceed would be for k0s to delete the db. I'm of the opinion that k0s should not take that drastic of an action and therefore, it is probably best to use the db as the source of truth and require the user take manual intervention when in this state.

@twz123
Copy link
Member

twz123 commented Nov 13, 2024

Although I'm a little unclear of my understanding of the state of the filesystem upon certain errors, I'm concerned that if the file marker we missing and there were an etcd database then the only way to proceed would be for k0s to delete the db. I'm of the opinion that k0s should not take that drastic of an action and therefore, it is probably best to use the db as the source of truth and require the user take manual intervention when in this state.

Fair enough. It's just unfortunate that these storage-specific implementation details leak into the more storage-agnostic parts of k0s. A proper abstraction around that would make it nicer IMO. But let's leave that for the future.

Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Dec 13, 2024
@twz123 twz123 removed the Stale label Dec 16, 2024
@emosbaugh
Copy link
Contributor Author

This is still an issue

@juanluisvaladas
Copy link
Contributor

juanluisvaladas commented Dec 23, 2024

Hi @emosbaugh I see #5151 is approved. Is there anything pending besides merging or it's just that last step? We just merged it, but you're welcome to merge it yourself :)

@emosbaugh
Copy link
Contributor Author

Nothing else pending. Thanks @juanluisvaladas !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants