Skip to content

Commit

Permalink
fixup! K8s cluster robustness features (#414)
Browse files Browse the repository at this point in the history
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
  • Loading branch information
cah-hbaum committed Nov 20, 2023
1 parent fc3a21b commit 1c7e61a
Showing 1 changed file with 5 additions and 51 deletions.
56 changes: 5 additions & 51 deletions Standards/scs-0215-v1-robustness-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ different priority levels and rate limit maximums.
The concept documentation offers a more in-depth explanation of the feature:
[Flow Control](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/)

### etcd compaction/defragmentation
### etcd maintenance

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be
accessed by a distributed system or cluster of machines. For these reasons, etcd was chosen as the default database
Expand All @@ -89,52 +89,13 @@ gives back disk space to the underlying file system and can help bring the clust
ran out of space earlier.

This can be achieved by providing the necessary flags/parameters to etcd, either via the KubeadmControlPlane or in the
configuration file of the etcd cluster, if it is managed independent from the Kubernetes cluster.
configuration file of the etcd cluster, if it is managed independent of the Kubernetes cluster.
Possible flags, that can be set for this feature, are:

* auto-compaction-mode
* auto-compaction-retention

etcd cluster defragmentation unfortunately can't be done automatically. Instead the user would need to manually call
the defrag command on the cluster. In order to mitigate this, a systemd (or similar) job could be created, which
periodically calls the defragmentation procedure. Unfortunately, simultaneous defragmentation of all members of an etcd
cluster would block read and write procedures. A preferable strategy to mitigate this would be the following:

* defragment the non leader etcd members first
* change the leadership to the randomly selected and defragmentation completed etcd member
* defragment the local (ex-leader) etcd member

This example was taken from the [Maintenance and Troubleshooting page](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/blob/main/doc/Maintenance_and_Troubleshooting.md#defragmentation-and-backup)
page of the SCS documentation, which was derived in part from the [OpenShift Host Practices](https://docs.openshift.com/container-platform/4.9/scalability_and_performance/recommended-host-practices.html#automatic-defrag-etcd-data_recommended-host-practices).

An example for a defragmentation job, e.g. as a systemd service, and its helpers could be the following:

```bash
[Unit]
Description=Run etcdctl defrag
Documentation=https://etcd.io/docs/v3.3.12/op-guide/maintenance/#defragmentation
After=network.target
[Service]
Type=oneshot
Environment="LOG_DIR=/var/log"
Environment="ETCDCTL_API=3"
ExecStart=/usr/local/sbin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt defrag
[Install]
WantedBy=multi-user.target
```

```bash
[Unit]
Description=Run etcd-defrag.service every day
After=network.target
[Timer]
OnCalendar=*-*-* 02:00:0
RandomizedDelaySec=10m
[Install]
WantedBy=multi-user.target
```

More information about compaction and defragmentation can be found in the respective etcd documentation
More information about compaction can be found in the respective etcd documentation
[etcd maintenance](https://etcd.io/docs/v3.3/op-guide/maintenance/)

### etcd backup
Expand Down Expand Up @@ -227,7 +188,7 @@ It is also RECOMMENDED to activate the Kubernetes API priority and fairness feat
which also uses the aforementioned cluster parameters to better queue, schedule and
prioritize incoming requests.
### etcd compaction/defragmentation
### etcd compaction
etcd needs to be cleaned up regularly, so that it functions correctly and doesn't take
up too much space, which happens because of its increase of the keyspace.
Expand All @@ -237,13 +198,6 @@ To compact the etcd keyspace, the following flags/parameters MUST be set for etc
* auto-compaction-mode = periodic
* auto-compaction-retention = 8h
OPTIONALLY, a cluster defragmentation can be carried out regularly.
To do this, it is RECOMMENDED to create a systemd (or similar automatic job) in order
to execute this defragmentation regularly in a fixed timeframe.
An example for such a systemd job can be found in the chapter [Design Considerations].
It is important to note, that such a defragmentation could lead to service interruptions.
Therefore, such a process should at best be carried during times of low traffic in order
to not disrupt normal workflow.
### etcd backup
Expand Down Expand Up @@ -294,4 +248,4 @@ for this, since it is dependent on the CA.

## Conformance Tests

Conformance Tests, OPTIONAL
*Conformance Tests, OPTIONAL*

0 comments on commit 1c7e61a

Please sign in to comment.