Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: join node creates new cluster when initial etcd sync config fails #5151

12 changes: 8 additions & 4 deletions cmd/controller/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -670,15 +670,19 @@ func (c *command) startWorker(ctx context.Context, profile string, nodeConfig *v
return wc.Start(ctx)
}

// If we've got CA in place we assume the node has already joined previously
// If we've got an etcd data directory in place for embedded etcd, or a ca for
// external or other storage types, we assume the node has already joined
// previously.
func (c *command) needToJoin(nodeConfig *v1beta1.ClusterConfig) bool {
if nodeConfig.Spec.Storage.Type == v1beta1.EtcdStorageType && !nodeConfig.Spec.Storage.Etcd.IsExternalClusterUsed() {
// Use the main etcd data directory as the source of truth to determine if this node has already joined
// See https://etcd.io/docs/v3.5/learning/persistent-storage-files/#bbolt-btree-membersnapdb
return !file.Exists(filepath.Join(c.K0sVars.EtcdDataDir, "member", "snap", "db"))
twz123 marked this conversation as resolved.
Show resolved Hide resolved
}
if file.Exists(filepath.Join(c.K0sVars.CertRootDir, "ca.key")) &&
file.Exists(filepath.Join(c.K0sVars.CertRootDir, "ca.crt")) {
return false
}
if nodeConfig.Spec.Storage.Type == v1beta1.EtcdStorageType && !nodeConfig.Spec.Storage.Etcd.IsExternalClusterUsed() {
return !file.Exists(filepath.Join(c.K0sVars.EtcdDataDir, "member", "snap", "db"))
}
return true
}

Expand Down
7 changes: 7 additions & 0 deletions pkg/component/controller/etcd.go
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,11 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest)
return err
},
// When joining multiple nodes in parallel, etcd can lose consensus and will return 500 responses
// Allow for more time to recover (~ 4 minutes = 0+1+2+4+8+16+32+60+60+60)
retry.Attempts(10),
retry.Delay(1*time.Second),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the number of attempts to 20. So I think it's 127 seconds plus like 12 minutes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will it be likely that joining will succeed after 15 minutes, when it didn't after 5 minutes? I'm just thinking if it makes sense to wait for that long, as surrounding tooling may have its own, shorter timeouts. How long does k0sctl wait until it aborts the join process? /cc @kke

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some discussion here related to why the timeout was increased.

Looks like k0sctl waits 2 minutes but does not face the issue we are facing because join is sequential.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @kke @twz123 . I'm waiting on a response to the above ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the actual timeout now with this implementation? I don't think it makes sense to have it more than say 5mins, if the joining won't work in the first 5 mins, what would make it work after that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnummelin ive decreased it to ~ 4 minutes

retry.MaxDelay(60*time.Second),
retry.Context(ctx),
retry.LastErrorOnly(true),
retry.OnRetry(func(attempt uint, err error) {
Expand Down Expand Up @@ -191,6 +196,8 @@ func (e *Etcd) Start(ctx context.Context) error {
"--enable-pprof": "false",
}

// Use the main etcd data directory as the source of truth to determine if this node has already joined
// See https://etcd.io/docs/v3.5/learning/persistent-storage-files/#bbolt-btree-membersnapdb
if file.Exists(filepath.Join(e.K0sVars.EtcdDataDir, "member", "snap", "db")) {
logrus.Warnf("etcd db file(s) already exist, not gonna run join process")
} else if e.JoinClient != nil {
Expand Down
Loading