diff --git a/keps/prod-readiness/sig-node/2008.yaml b/keps/prod-readiness/sig-node/2008.yaml index 247d06c16ce..f1e4256bf9d 100644 --- a/keps/prod-readiness/sig-node/2008.yaml +++ b/keps/prod-readiness/sig-node/2008.yaml @@ -1,3 +1,5 @@ kep-number: 2008 alpha: approver: "@ehashman" +beta: + approver: "@deads2k" diff --git a/keps/sig-node/2008-forensic-container-checkpointing/README.md b/keps/sig-node/2008-forensic-container-checkpointing/README.md index 723b3e5969a..3fa2cd01747 100644 --- a/keps/sig-node/2008-forensic-container-checkpointing/README.md +++ b/keps/sig-node/2008-forensic-container-checkpointing/README.md @@ -25,8 +25,11 @@ - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) - [Dependencies](#dependencies) - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) @@ -37,14 +40,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*. - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [x] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free - [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed -- [ ] Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes [kubernetes.io]: https://kubernetes.io/ @@ -111,6 +118,52 @@ For the first implementation we do not want to support restore in the outside of Kubernetes. The restore is a container engine only operation in this first step. +A high level view on the implementation is that triggering the *kubelet* API +endpoint will trigger the `ContainerCheckpoint` CRI API endpoint to create a +checkpoint at the location defined by the *kubelet*. In the checkpoint request +the kubelet will specify the name of the checkpoint archive as +`checkpoint---.tar` and also request to +store the checkpoint archive in the `checkpoints` directory below its root +directory (as defined by `--root-dir`). This defaults to +`/var/lib/kubelet/checkpoints`. + +To trigger a checkpoint following HTTP Request has to be made against the *kubelet*: + +- `POST /checkpoint/{namespace}/{pod}/{container}`` +- Parameters + - namespace (in path): string, required, Namespace + - pod (in path): string, required, Pod + - container (in path): string, required, Container + - timeout (in query): integer, Timeout in seconds to wait until the checkpoint + creation is finished. If zero or no timeout is specified the default CRI + timeout value will be used. Checkpoint creation time depends directly on the + used memory of the container. The more memory a container uses the more time + is required to create the corresponding checkpoint. +- Response + - 200: OK + - 401: Unauthorized + - 404: Not Found (if the ContainerCheckpoint feature gate is disabled) + - 404: Not Found (if the specified namespace, pod or container cannot be found) + - 500: Internal Server Error (if the CRI implementation encounter an error during checkpointing (see error message for further details)) + - 500: Internal Server Error (if the CRI implementation does not implement the checkpoint CRI API (see error message for further details)) + +The kubelet APIs are usually restricted to cluster admins and is only accessible +via `localhost`. Users will not have access to this for now. If, in the future +the checkpoint API endpoint is moved out of the kubelet it can then have proper +RBAC. This is something that cannot be provided as a *kubelet* API. Also see + + +To further secure the *kubelet* API endpoint there will be a kubelet auth endpoint +added to the checkpoint sub-resource. The goal is to allow administrators to +restrict the API endpoint and to ensure that users do not have access to the +endpoint via the kubernetes API server proxy mode. + +Expected latency depends directly on size of the used memory of the processes in +the container. The more memory is used the longer the operation will require. +The newly introduced CRI API includes a `timeout` parameter to automatically cancel +the request if it requires more time than requested. If the `timeout` parameter +is not specified the CRI default timeout from the *kubelet* is used (2 minutes). + #### CRI Updates The CRI API will be extended to introduce one new RPC: @@ -125,6 +178,10 @@ message CheckpointContainerRequest { string container_id = 1; // Location of the checkpoint archive used for export/import string location = 2; + // Timeout in seconds for the checkpoint to complete. + // Timeout of zero means to use the CRI default. + // Timeout > 0 means to use the user specified timeout. + int64 timeout = 3; } message CheckpointContainerResponse {} @@ -146,6 +203,23 @@ In its first implementation the risks are low as it tries to be a CRI API change with minimal changes to the kubelet and it is gated by the feature gate `ContainerCheckpoint`. +One possible risk that was identified during Alpha is that the disk of the node +requesting the checkpoints could fill up if too many checkpoints are created and +the node will be marked as bot healthy. One approach to solve this was some kind +of garbage collection of checkpoint archives. A pull request to implement +garbage collection was opened +([#115888](https://github.com/kubernetes/kubernetes/pull/115888)) but during +review it became clear that the kubelet might not be the right place to +implement checkpoint archive garbage collection and the pull request was closed +again. Currently the most likely solution seems to be to implement the garbage +collection in an operator. Garbage collection via an operator could +clean up the checkpoint directory and the node would stay healthy. +Currently manual cleanup might be required to ensure the node does not run +out of disk space and stays healthy. Especially in situation where checkpoint +creation requests are triggered automatically and not manually. As service that +requests many checkpoints should ensure to remove the requested checkpoints as +soon as possible. + ## Design Details The feature gate `ContainerCheckpoint` will ensure that the API @@ -185,6 +259,16 @@ when drafting this test plan. existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. +For Alpha to Beta graduation existing tests will extended with following tests: + +- [] For Alpha tests are trying to be clever and automatically skip the test if + disabled. The new tests should explicitly test following situation without + automatic skipping to ensure we are not hiding potential errors. + - [ ] Test to ensure the feature does not work with the feature gate disabled. + - [ ] Test to ensure the feature does work if enabled. +- [ ] Test CRI metrics related to `ContainerCheckpoint` CRI RPC. +- [ ] Test kubelet metrics related to kubelet `checkpoint` API endpoint. + ##### Prerequisite testing updates -- `pkg/kubelet`: 06-17-2022 - 64.5 -- `pkg/kubelet/container`: 06-17-2022 - 52.1 -- `pkg/kubelet/server`: 06-17-2022 - 64.3 -- `pkg/kubelet/cri/remote`: 06-17-2022 - 13.2 +- Test coverage before Alpha graduation + - `pkg/kubelet`: 06-17-2022 - 64.5 + - `pkg/kubelet/container`: 06-17-2022 - 52.1 + - `pkg/kubelet/server`: 06-17-2022 - 64.3 + - `pkg/kubelet/cri/remote`: 06-17-2022 - 13.2 +- Test coverage before Beta graduation + - `pkg/kubelet`: 02-08-2024 - 68.9 + - `pkg/kubelet/container`: 02-08-2024 - 55.7 + - `pkg/kubelet/server`: 02-08-2024 - 65.1 + - `pkg/kubelet/cri/remote`: 02-08-2024 - 18.9 ##### Integration tests @@ -244,25 +334,66 @@ We expect no non-infra related flakes in the last month as a GA graduation crite Once CRI implementation provide the relevant RPC calls the e2e tests will not fail but need to be extended. +- Once the initial Alpha release CRI-O supports the + `CheckpointContainer` CRI RPC and tests have been + enhanced to support CRI implementation that implement + the `CheckpointContainer` CRI RPC + +- Once Kubernetes was released with the `CheckpointContainer` CRI RPC + CRI-O has been updated to support the new CRI RPC. + The tests have been enhanced to work with CRI implementations + that support the `CheckpointContainer` CRI RPC as well as + CRI implementations that do not support it. The tests also handle + if the corresponding feature gate is disabled or enabled: + + +- As the tests are hidden behind the feature gate `ContainerCheckpoint` during + Alpha phase and only available in combination with CRI-O automatic, the tests + have been skipped so far. With graduation to Beta the tests should appear in + CRI-O based setups. Due to way the current setup of not running all Alpha + features enabled with CRI-O, no results have been collected and tests have + been skipped. + ### Graduation Criteria #### Alpha -- [ ] Implement the new feature gate and kubelet implementation -- [ ] Ensure proper tests are in place -- [ ] Update documentation to make the feature visible +- [X] Implement the new feature gate and kubelet implementation +- [X] Ensure proper tests are in place +- [X] Update documentation to make the feature visible + - + - + - #### Alpha to Beta Graduation -At least one container engine has to have implemented the -corresponding CRI APIs to introduce e2e test for checkpointing. +At least one container engine implemented the corresponding CRI APIs: +- [x] CRI-O + +In Kubernetes: + +- [x] No major bugs reported in the previous cycle - [ ] Enable the feature per default -- [ ] No major bugs reported in the previous cycle +- [ ] Add separate sub-resource permission to control permissions + at +- [ ] Add necessary metrics as described in the PRR sections and update the KEP with the metrics + names once they exist + - [ ] Add CRI metrics + - [ ] Add kubelet metrics (this already exist under the name `checkpoint`) + #### Beta to GA Graduation -TBD +CRI-O as well as containerd have to have implemented the corresponding CRI APIs: + +- [x] CRI-O +- [ ] containerd () + +Ensure that e2e tests are working with + +- [x] CRI-O +- [ ] containerd () ### Upgrade / Downgrade Strategy @@ -292,7 +423,107 @@ Checkpointing containers will be possible again. ###### Are there any tests for feature enablement/disablement? -Currently no. +Currently the test will automatically be skipped if the feature is not enabled. +Tests will be extended to explicitly test if the feature is disabled as well +as if it is enabled. + +### Rollout, Upgrade and Rollback Planning + +If it is not enabled via the feature gate it will return `404 page not found`. +If it is not enabled in the underlying container engine a `500` will be returned +with an error message from the container engine. If it is enabled the API +endpoint exists if disabled then it does not exist. No planning necessary. + +Documented at + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +The feature depends on the existence of the `CheckpointContainer` CRI RPC. +If the underlying container engine does not support or if it is not implemented, +the kubelet API endpoint will fail with `500`. The error code is the same if the +container engine does not implement it explicitly or if the underlying container +engine is too old. The difference between does two failures are only visible in +the error message returned from the container engine. + +If the underlying container engine does not support the CRI RPC the kubelet API endpoint +will always return `500`. + +It cannot directly impact running workloads, but if the *kubelet* API endpoint is +called if the underlying container engine does no longer support it, the checkpoint +request will fail. + +###### What specific metrics should inform a rollback? + +CRI metrics will be added to track checkpointing failures to inform a rollback +decision. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +No data is stored, so re-enabling starts from a clean slate. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +Querying the state of the feature gate offers the possibility to detect +if the API endpoint will return `404` or not. + + + +###### How can an operator determine if the feature is in use by workloads? + +As it is not exposed in the Kubernetes API it cannot be determined. This is +only visible in the kubelet. Also, this is a feature workloads are not using +directly, but which is only external entities can trigger. Access to the +*kubelet* API endpoints is needed. It might be detectable if operators can +query the state of different feature gates. An operator could also use metrics +to determine that this feature is in use. Metrics seem to be already collected +at . + + + +###### How can someone using this feature know that it is working for their instance? + +The kubelet API endpoint can return following codes: + +- 200: checkpoint archive was successfully created +- 404: feature is not enabled +- 500: underlying container engine does not support checkpointing containers + +Documented at + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +The expectation is that it should always succeed. A failed checkpoint does not +break the actual workload. A failed checkpoint only means that the checkpoint +request failed without effects on the workload. The expectation is also that +checkpointing either is always successful or never. From today's point of view this +means that the expectation is 100% availability or 0% availability. Experience +in Podman/Docker and other container engines so far indicates that. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +Currently the *kubelet* collects metrics in the bucket `checkpoint`. This can be +used to determine the health of the service. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +CRI stats will be added for this as well as kubelet metrics tracking whether an +operation failed or succeeded. ### Dependencies @@ -300,6 +531,10 @@ CRIU needs to be installed on the node, but on most distributions it is already a dependency of runc/crun. It does not require any specific services on the cluster. +###### Does this feature depend on any specific services running in the cluster? + +Yes, the container engine must support the checkpoint CRI API call. + ### Scalability ###### Will enabling / using this feature result in any new API calls? @@ -332,7 +567,102 @@ the size of all memory pages in the checkpointed container. Each file in the con has been changed compared to the original version will also be part of the checkpoint. Disk usage will overall increase by the used memory of the container and the changed files. Checkpoint archive written to disk can optionally be compressed. The current implementation -does not compress the checkpoint archive on disk. +does not compress the checkpoint archive on disk. The cluster administrator is +responsible for monitoring disk usage and removing excess data. + +The kubelet will request a checkpoint from the underlying CRI implementation. In +the checkpoint request the kubelet will specify the name of the checkpoint +archive as `checkpoint---.tar` and also +request to store the checkpoint archive in the `checkpoints` directory below its +root directory (as defined by `--root-dir`). This defaults to +`/var/lib/kubelet/checkpoints`. + +To avoid running out of disk space an operator has been introduced: + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +During checkpointing each memory page will be written to disk. Disk usage will increase by +the size of all memory pages in the checkpointed container. Each file in the container that +has been changed compared to the original version will also be part of the checkpoint. +Disk usage will overall increase by the used memory of the container and the changed files. +Checkpoint archive written to disk can optionally be compressed. The current implementation +does not compress the checkpoint archive on disk. The cluster administrator is +responsible for monitoring disk usage and removing excess data. + +The kubelet will request a checkpoint from the underlying CRI implementation. In +the checkpoint request the kubelet will specify the name of the checkpoint +archive as `checkpoint---.tar` and also +request to store the checkpoint archive in the `checkpoints` directory below its +root directory (as defined by `--root-dir`). This defaults to +`/var/lib/kubelet/checkpoints`. + +To avoid running out of disk space an operator has been introduced: + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +Like any other kubelet API endpoint this will fail if the API server is not available. + +###### What are other known failure modes? + +- The creation of the checkpoint archive can fail. + - Detection: Possible return codes are: + - 401: Unauthorized + - 404: Not Found (if the ContainerCheckpoint feature gate is disabled) + - 404: Not Found (if the specified namespace, pod or container cannot be found) + - 500: Internal Server Error (if the CRI implementation encounter an error + during checkpointing (see error message for further details)) + - 500: Internal Server Error (if the CRI implementation does not implement + the checkpoint CRI API (see error message for further details)) + - Also see: https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/ + - Mitigation: Do not checkpoint a container that cannot be checkpointed by CRIU. + - Diagnostics: The container engine will provide the location of log file created + by CRIU with more details. + - Testing: Tests are currently covering if checkpointing is enabled in the kubelet + or not as well as covering if the underlying container engine supports the + corresponding CRI API calls. The most common checkpointing failure is if the + container is using an external hardware device like a GPU or InfiniBand which + usually do not exist in test systems. + +Checkpointing anything with access to an external hardware device like a GPU or +InfiniBand can fail. For each device a specific plugin needs to be added to CRIU. +For AMD GPUs this exists already today, but other GPUs will fail to be checkpointed. +If an unsupported device is used in the container the *kubelet* API endpoint will +return `500` with additional information in the error message. + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +As checkpointing is an optional feature outside of the pod lifecycle SLOs probably should +not be impacted. If SLOs are impacted then administrators should no longer call the +checkpoint *kubelet* API endpoint. During Alpha and Beta phase the feature gate can +also be used to turn the feature of. At this point in time it is unclear, but for a +possible GA phase this is maybe a feature that needs to be opt-in or opt-out. Something +that can be turned off during startup or runtime configuration. ## Implementation History @@ -350,6 +680,7 @@ does not compress the checkpoint archive on disk. * 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint` * 2022-04-05: Added CRI API section and targeted 1.25 * 2022-05-17: Remove *restore* RPC from the CRI API +* 2024-02-08: Graduation to Beta. ## Drawbacks diff --git a/keps/sig-node/2008-forensic-container-checkpointing/kep.yaml b/keps/sig-node/2008-forensic-container-checkpointing/kep.yaml index b75942e82c8..1f60a5b960d 100644 --- a/keps/sig-node/2008-forensic-container-checkpointing/kep.yaml +++ b/keps/sig-node/2008-forensic-container-checkpointing/kep.yaml @@ -7,7 +7,7 @@ participating-sigs: - TBD status: implementable creation-date: 2020-09-16 -last-updated: 2022-05-17 +last-updated: 2024-02-08 reviewers: - "@mrunalp" - "@elfinhe" @@ -15,18 +15,18 @@ approvers: - "@dchen1107" # The target maturity stage in the current dev cycle for this KEP. -stage: alpha +stage: beta # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.25" +latest-milestone: "v1.30" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.25" - beta: "v1.26" - stable: "v1.28" + beta: "v1.30" + stable: "v1.33" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled