fix: join node creates new cluster when initial etcd sync config fails #5151

emosbaugh · 2024-10-23T13:56:44Z

Description

If etcd join fails to sync the etcd config and the k0s process exits, the pki ca files exist and etcd creates a new cluster rather than joining the existing one. Rather than check the pki dir for embedded etcd, check the etcd data directory exists as we do here.

I am open to suggestions here if I am checking the wrong thing as I cannot test this and am taking a guess at a solution.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

How Has This Been Tested?

Manual test
Auto test added

Checklist:

My code follows the style guidelines of this project
My commit messages are signed-off
I have performed a self-review of my own code
[] I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

emosbaugh · 2024-11-05T18:42:21Z

@jnummelin @twz123 I'm a bit stuck at this point with how to proceed with handling this case. Could you take another look at this PR and give me some guidance? Thank you!

twz123

Looks simple enough! However, I'd leave out 9706878 for now, since k0s will retry all join errors no matter what caused them. So returning a 503 instead of a 500 is probably not really worth it?

cmd/controller/controller.go

twz123 · 2024-11-08T07:21:03Z

pkg/component/controller/etcd.go

@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque
 			etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest)
 			return err
 		},
+		retry.Delay(1*time.Second),


Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?

I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼

I increased the number of attempts to 20. So I think it's 127 seconds plus like 12 minutes

Okay. Will it be likely that joining will succeed after 15 minutes, when it didn't after 5 minutes? I'm just thinking if it makes sense to wait for that long, as surrounding tooling may have its own, shorter timeouts. How long does k0sctl wait until it aborts the join process? /cc @kke

There is some discussion here related to why the timeout was increased.

Looks like k0sctl waits 2 minutes but does not face the issue we are facing because join is sequential.

Ping @kke @twz123 . I'm waiting on a response to the above ^

So what is the actual timeout now with this implementation? I don't think it makes sense to have it more than say 5mins, if the joining won't work in the first 5 mins, what would make it work after that?

@jnummelin ive decreased it to ~ 4 minutes

emosbaugh · 2024-11-10T14:27:31Z

@twz123 feedback addressed. Can you please take another look. Thanks

twz123 · 2024-11-12T14:00:45Z

pkg/component/controller/etcd.go

@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque
 			etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest)
 			return err
 		},
+		retry.Delay(1*time.Second),


I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

This reverts commit 9706878. Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

There might be some corner case still which this doesn't cover

k0s-bot · 2024-12-23T13:01:43Z

Backport failed for release-1.29, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.29
git worktree add -d .worktree/backport-5151-to-release-1.29 origin/release-1.29
cd .worktree/backport-5151-to-release-1.29
git switch --create backport-5151-to-release-1.29
git cherry-pick -x 4edfec62364dabe19f285ff6c8f4100c4216303b 808d811f545244c332af05286dddab292f5365c7 9ccb6f5997b4ab8697ce74c962d0a54ccc0bb900 917d17b052aa34c12b80911ff323a3871a5f2c3a 8f267f47217974a82789e137a9e271069febaaa1 1037204f9e13a544f9b363c6925c4b0702c3a450 e6d4e2258c89363bb7cd6e8cc0d4a81a6beaefd8

k0s-bot · 2024-12-23T13:01:45Z

Backport failed for release-1.30, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.30
git worktree add -d .worktree/backport-5151-to-release-1.30 origin/release-1.30
cd .worktree/backport-5151-to-release-1.30
git switch --create backport-5151-to-release-1.30
git cherry-pick -x 4edfec62364dabe19f285ff6c8f4100c4216303b 808d811f545244c332af05286dddab292f5365c7 9ccb6f5997b4ab8697ce74c962d0a54ccc0bb900 917d17b052aa34c12b80911ff323a3871a5f2c3a 8f267f47217974a82789e137a9e271069febaaa1 1037204f9e13a544f9b363c6925c4b0702c3a450 e6d4e2258c89363bb7cd6e8cc0d4a81a6beaefd8

k0s-bot · 2024-12-23T13:01:48Z

Successfully created backport PR for release-1.31:

[Backport release-1.31] fix: join node creates new cluster when initial etcd sync config fails #5385

Signed-off-by: Alexey Makhov <amakhov@mirantis.com>

[Backport release-1.29] Backport etcd join workflow fix from #5151

[Backport release-1.30] Backport etcd join workflow fix from #5151

twz123 mentioned this pull request Oct 23, 2024

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

Closed

4 tasks

emosbaugh force-pushed the issue-5149-etcd-creates-new-cluster-rather-than-join-if-sync-fails branch from f40047f to 9706878 Compare November 5, 2024 18:36

twz123 reviewed Nov 8, 2024

View reviewed changes

emosbaugh force-pushed the issue-5149-etcd-creates-new-cluster-rather-than-join-if-sync-fails branch from 6f772c3 to 7429cb8 Compare November 8, 2024 20:31

emosbaugh marked this pull request as ready for review November 8, 2024 20:32

emosbaugh requested review from a team as code owners November 8, 2024 20:32

emosbaugh requested review from twz123 and jnummelin November 8, 2024 20:32

twz123 previously approved these changes Nov 12, 2024

View reviewed changes

emosbaugh dismissed twz123’s stale review via 75ffbf7 December 17, 2024 14:36

emosbaugh added 7 commits December 17, 2024 09:33

fix: join node creates new cluster when initial etcd sync config fails

4edfec6

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

increase etcd sync config backoff

808d811

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

return service unavailable from k0s api when etcd is unhealthy

9ccb6f5

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

Revert "return service unavailable from k0s api when etcd is unhealthy"

917d17b

This reverts commit 9706878. Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

explain etcd member/snap/db in comment

8f267f4

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

justify etcd sync retry duration in comment

1037204

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

decrease retry attemps for sync etcd config

e6d4e22

Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>

emosbaugh force-pushed the issue-5149-etcd-creates-new-cluster-rather-than-join-if-sync-fails branch from 75ffbf7 to e6d4e22 Compare December 17, 2024 17:33

makhov previously approved these changes Dec 20, 2024

View reviewed changes

makhov self-requested a review December 20, 2024 09:42

makhov approved these changes Dec 20, 2024

View reviewed changes

juanluisvaladas merged commit 3fdb4c5 into k0sproject:main Dec 23, 2024
89 checks passed

k0s-bot mentioned this pull request Dec 23, 2024

[Backport release-1.31] fix: join node creates new cluster when initial etcd sync config fails #5385

Merged

makhov added a commit to makhov/k0s that referenced this pull request Dec 23, 2024

Backport etcd join workflow fix from k0sproject#5151

6021b7d

Signed-off-by: Alexey Makhov <amakhov@mirantis.com>

makhov mentioned this pull request Dec 23, 2024

[Backport release-1.30] Backport etcd join workflow fix from #5151 #5386

Merged

makhov added a commit to makhov/k0s that referenced this pull request Dec 23, 2024

Backport etcd join workflow fix from k0sproject#5151

a5d0d1b

Signed-off-by: Alexey Makhov <amakhov@mirantis.com>

makhov added a commit to makhov/k0s that referenced this pull request Dec 23, 2024

Backport etcd join workflow fix from k0sproject#5151

e3c7eb9

Signed-off-by: Alexey Makhov <amakhov@mirantis.com>

makhov mentioned this pull request Dec 23, 2024

[Backport release-1.29] Backport etcd join workflow fix from #5151 #5389

Merged

makhov added a commit that referenced this pull request Dec 24, 2024

Merge pull request #5389 from makhov/backport-5151-to-release-1.29

201d942

[Backport release-1.29] Backport etcd join workflow fix from #5151

makhov added a commit that referenced this pull request Dec 24, 2024

Merge pull request #5386 from makhov/backport-5151-to-release-1.30

da250d3

[Backport release-1.30] Backport etcd join workflow fix from #5151

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: join node creates new cluster when initial etcd sync config fails #5151

fix: join node creates new cluster when initial etcd sync config fails #5151

emosbaugh commented Oct 23, 2024

emosbaugh commented Nov 5, 2024

twz123 left a comment

twz123 Nov 8, 2024

emosbaugh Nov 8, 2024

twz123 Nov 12, 2024

emosbaugh Nov 12, 2024

twz123 Nov 13, 2024

emosbaugh Nov 13, 2024

emosbaugh Nov 26, 2024

jnummelin Dec 17, 2024

emosbaugh Dec 17, 2024

emosbaugh commented Nov 10, 2024

twz123 Nov 12, 2024

k0s-bot commented Dec 23, 2024

k0s-bot commented Dec 23, 2024

k0s-bot commented Dec 23, 2024

fix: join node creates new cluster when initial etcd sync config fails #5151

fix: join node creates new cluster when initial etcd sync config fails #5151

Conversation

emosbaugh commented Oct 23, 2024

Description

Type of change

How Has This Been Tested?

Checklist:

emosbaugh commented Nov 5, 2024

twz123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emosbaugh commented Nov 10, 2024

Choose a reason for hiding this comment

k0s-bot commented Dec 23, 2024

k0s-bot commented Dec 23, 2024

k0s-bot commented Dec 23, 2024