Skip to content

Commit

Permalink
Graduate "Forensic Container Checkpointing" to Beta
Browse files Browse the repository at this point in the history
As defined in the existing KEP the steps to graduate from Alpha to Beta
are

   At least one container engine has to have implemented the
   corresponding CRI APIs to introduce e2e test for checkpointing.

   - [ ] Enable the feature per default
   - [ ] No major bugs reported in the previous cycle

CRI-O implemented the corresponding CRI RPC and no major bugs
have been reported since the initial release in 1.25.

Signed-off-by: Adrian Reber <areber@redhat.com>
  • Loading branch information
adrianreber committed Feb 6, 2024
1 parent 12cc497 commit 7239c99
Show file tree
Hide file tree
Showing 3 changed files with 189 additions and 11 deletions.
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/2008.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 2008
alpha:
approver: "@ehashman"
beta:
approver: "@deads2k"
190 changes: 183 additions & 7 deletions keps/sig-node/2008-forensic-container-checkpointing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
Expand Down Expand Up @@ -125,6 +128,10 @@ message CheckpointContainerRequest {
string container_id = 1;
// Location of the checkpoint archive used for export/import
string location = 2;
// Timeout in seconds for the checkpoint to complete.
// Timeout of zero means to use the CRI default.
// Timeout > 0 means to use the user specified timeout.
int64 timeout = 3;
}
message CheckpointContainerResponse {}
Expand All @@ -146,6 +153,16 @@ In its first implementation the risks are low as it tries to be a CRI API
change with minimal changes to the kubelet and it is gated by the feature
gate `ContainerCheckpoint`.

One possible risk that was identified during Alpha is that the disk of
the node requesting the checkpoints could fill up if too many checkpoints
are created. One approach to solve this was some kind of garbage collection
of checkpoint archives. A pull request to implement garbage collection
was opened ([#115888](https://github.com/kubernetes/kubernetes/pull/115888))
but during review it became clear that the kubelet might not be the right
place to implement checkpoint archive garbage collection and the pull request
was closed again. Currently the most likely solution seems to be to implement
the garbage collection in an operator.

## Design Details

The feature gate `ContainerCheckpoint` will ensure that the API
Expand Down Expand Up @@ -244,21 +261,41 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
Once CRI implementation provide the relevant RPC calls
the e2e tests will not fail but need to be extended.

- Once the initial Alpha release CRI-O supports the
`CheckpointContainer` CRI RPC and tests have been
enhanced to support CRI implementation that implement
the `CheckpointContainer` CRI RPC

- Once Kubernetes was released with the `CheckpointContainer` CRI RPC
CRI-O has been updated to support the new CRI RPC.
The tests have been enhanced to work with CRI implementations
that support the `CheckpointContainer` CRI RPC as well as
CRI implementations that do not support it. The tests also handle
if the corresponding feature gate is disabled or enabled:
<https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/checkpoint_container.go>

### Graduation Criteria

#### Alpha

- [ ] Implement the new feature gate and kubelet implementation
- [ ] Ensure proper tests are in place
- [ ] Update documentation to make the feature visible
- [X] Implement the new feature gate and kubelet implementation
- [X] Ensure proper tests are in place
- [X] Update documentation to make the feature visible
- <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
- <https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/>
- <https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/>

#### Alpha to Beta Graduation

At least one container engine has to have implemented the
corresponding CRI APIs to introduce e2e test for checkpointing.
CRI-O as well as containerd have to have implemented the corresponding CRI APIs:

- [x] CRI-O
- [ ] containerd (<https://github.com/containerd/containerd/pull/6965>)

In Kubernetes:

- [ ] Enable the feature per default
- [ ] No major bugs reported in the previous cycle
- [x] No major bugs reported in the previous cycle

#### Beta to GA Graduation

Expand Down Expand Up @@ -292,14 +329,94 @@ Checkpointing containers will be possible again.

###### Are there any tests for feature enablement/disablement?

Currently no.
Currently the test will automatically be skipped if the feature is not enabled.

### Rollout, Upgrade and Rollback Planning

Does not apply as the feature is an additional API endpoint with no
dependencies on other functionality. If it is not enabled via the feature
gate it will return `404 page not found`. If it is not enabled in the
underlying container engine a `500` will be returned with an error
message from the container engine. If it is enabled the API endpoint exists
if disabled then it does not exist. No planning necessary.

Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
<!--
This section must be completed when targeting beta to a release.
-->

###### How can a rollout or rollback fail? Can it impact already running workloads?

At this point it is still a kubelet only API endpoint and has no dependencies
on other components.

###### What specific metrics should inform a rollback?

The only metric is the return code from the API endpoint.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No, this does not seem to apply for this feature.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

### Monitoring Requirements

Querying the state of the feature gate offers the possibility to detect
if the API endpoint will return `404` or not.

<!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->

###### How can an operator determine if the feature is in use by workloads?

As it is not exposed in the Kubernetes API it cannot be determined. This is
only visible in the kubelet.

<!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

###### How can someone using this feature know that it is working for their instance?

The kubelet API endpoint can return following codes:

- 200: checkpoint archive was successfully created
- 404: feature is not enabled
- 500: underlying container engine does not support checkpointing containers

Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Does not apply as the enhancement will only be called when requested. Not a service.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Does not apply as the enhancement will only be called when requested. Not a service.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

There are no metrics.

### Dependencies

CRIU needs to be installed on the node, but on most distributions it is already
a dependency of runc/crun. It does not require any specific services on the
cluster.

###### Does this feature depend on any specific services running in the cluster?

No, the container engine, however, must support the checkpoint CRI API call.

### Scalability

###### Will enabling / using this feature result in any new API calls?
Expand Down Expand Up @@ -334,6 +451,64 @@ Disk usage will overall increase by the used memory of the container and the cha
Checkpoint archive written to disk can optionally be compressed. The current implementation
does not compress the checkpoint archive on disk.

To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

During checkpointing each memory page will be written to disk. Disk usage will increase by
the size of all memory pages in the checkpointed container. Each file in the container that
has been changed compared to the original version will also be part of the checkpoint.
Disk usage will overall increase by the used memory of the container and the changed files.
Checkpoint archive written to disk can optionally be compressed. The current implementation
does not compress the checkpoint archive on disk.

To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>

### Troubleshooting

<!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->

###### How does this feature react if the API server and/or etcd is unavailable?

The feature does not care if the API server and/or etcd is unavailable.

###### What are other known failure modes?

- The creation of the checkpoint archive can fail.
- Detection: See https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
- Mitigation: Do not checkpoint a container that cannot be checkpointed by CRIU.
- Diagnostics: The container engine will provide the location of log file created
by CRIU with more details.
- Testing: Tests are currently covering if checkpointing is enabled in the kubelet
or not as well as covering if the underlying container engine supports the
corresponding CRI API calls. The most common checkpointing failure is if the
container is using an external hardware device like a GPU or InfiniBand which
usually do not exist in test systems.

<!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->

###### What steps should be taken if SLOs are not being met to determine the problem?

## Implementation History

* 2020-09-16: Initial version of this KEP
Expand All @@ -350,6 +525,7 @@ does not compress the checkpoint archive on disk.
* 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint`
* 2022-04-05: Added CRI API section and targeted 1.25
* 2022-05-17: Remove *restore* RPC from the CRI API
* 2023-10-09: Beta graduation in 1.30

## Drawbacks

Expand Down
8 changes: 4 additions & 4 deletions keps/sig-node/2008-forensic-container-checkpointing/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,18 @@ approvers:
- "@dchen1107"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.25"
latest-milestone: "v1.30"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.25"
beta: "v1.26"
stable: "v1.28"
beta: "v1.30"
stable: "v1.33"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down

0 comments on commit 7239c99

Please sign in to comment.