-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: join node creates new cluster when initial etcd sync config fails #5151
fix: join node creates new cluster when initial etcd sync config fails #5151
Conversation
f40047f
to
9706878
Compare
@jnummelin @twz123 I'm a bit stuck at this point with how to proceed with handling this case. Could you take another look at this PR and give me some guidance? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks simple enough! However, I'd leave out 9706878 for now, since k0s will retry all join errors no matter what caused them. So returning a 503 instead of a 500 is probably not really worth it?
@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque | |||
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest) | |||
return err | |||
}, | |||
retry.Delay(1*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I increased the number of attempts to 20. So I think it's 127 seconds plus like 12 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Will it be likely that joining will succeed after 15 minutes, when it didn't after 5 minutes? I'm just thinking if it makes sense to wait for that long, as surrounding tooling may have its own, shorter timeouts. How long does k0sctl wait until it aborts the join process? /cc @kke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some discussion here related to why the timeout was increased.
Looks like k0sctl waits 2 minutes but does not face the issue we are facing because join is sequential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the actual timeout now with this implementation? I don't think it makes sense to have it more than say 5mins, if the joining won't work in the first 5 mins, what would make it work after that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnummelin ive decreased it to ~ 4 minutes
6f772c3
to
7429cb8
Compare
@twz123 feedback addressed. Can you please take another look. Thanks |
@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque | |||
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest) | |||
return err | |||
}, | |||
retry.Delay(1*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼
Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
This reverts commit 9706878. Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
Signed-off-by: Ethan Mosbaugh <ethan@replicated.com>
75ffbf7
to
e6d4e22
Compare
There might be some corner case still which this doesn't cover
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release-1.29
git worktree add -d .worktree/backport-5151-to-release-1.29 origin/release-1.29
cd .worktree/backport-5151-to-release-1.29
git switch --create backport-5151-to-release-1.29
git cherry-pick -x 4edfec62364dabe19f285ff6c8f4100c4216303b 808d811f545244c332af05286dddab292f5365c7 9ccb6f5997b4ab8697ce74c962d0a54ccc0bb900 917d17b052aa34c12b80911ff323a3871a5f2c3a 8f267f47217974a82789e137a9e271069febaaa1 1037204f9e13a544f9b363c6925c4b0702c3a450 e6d4e2258c89363bb7cd6e8cc0d4a81a6beaefd8 |
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release-1.30
git worktree add -d .worktree/backport-5151-to-release-1.30 origin/release-1.30
cd .worktree/backport-5151-to-release-1.30
git switch --create backport-5151-to-release-1.30
git cherry-pick -x 4edfec62364dabe19f285ff6c8f4100c4216303b 808d811f545244c332af05286dddab292f5365c7 9ccb6f5997b4ab8697ce74c962d0a54ccc0bb900 917d17b052aa34c12b80911ff323a3871a5f2c3a 8f267f47217974a82789e137a9e271069febaaa1 1037204f9e13a544f9b363c6925c4b0702c3a450 e6d4e2258c89363bb7cd6e8cc0d4a81a6beaefd8 |
Successfully created backport PR for |
Signed-off-by: Alexey Makhov <amakhov@mirantis.com>
Signed-off-by: Alexey Makhov <amakhov@mirantis.com>
Signed-off-by: Alexey Makhov <amakhov@mirantis.com>
[Backport release-1.29] Backport etcd join workflow fix from #5151
[Backport release-1.30] Backport etcd join workflow fix from #5151
Description
Fixes #5149
If etcd join fails to sync the etcd config and the k0s process exits, the pki ca files exist and etcd creates a new cluster rather than joining the existing one. Rather than check the pki dir for embedded etcd, check the etcd data directory exists as we do here.
I am open to suggestions here if I am checking the wrong thing as I cannot test this and am taking a guess at a solution.
Type of change
How Has This Been Tested?
Checklist: