Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Implement tests for scs-0210-v2, scs-0214-v1 #481

Closed
mbuechse opened this issue Feb 23, 2024 · 12 comments
Closed

[Feature Request] Implement tests for scs-0210-v2, scs-0214-v1 #481

mbuechse opened this issue Feb 23, 2024 · 12 comments
Assignees
Labels
action required Issues or pull requests that urgently require action Container Issues or pull requests relevant for Team 2: Container Infra and Tooling enhancement New feature or request SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10
Milestone

Comments

@mbuechse
Copy link
Contributor

This is a high-priority crash operation: find a way to implement reasonable tests by 2024-02-28.

  • scs-0210-v1 does have tests; what needs to be done for v2?
  • scs-0214-v1 might be hard to test; what's the "MVP" that can be achieved within this timeframe?

Request assistance from whatever parties necessary!

@cah-hbaum
Copy link
Contributor

We integrated sonobuoy yesterday, but for the scs-0210-v2 we should probably still use and update the old python test from v1. The problem here is, its not really useful to update the test, since v2 adds a section talking about the support period for a K8s version. Testwise, this is contradicting to the stuff tested right now, since v1 checks a given cluster and if its version is up-to-date with the release timeline. v2 demands that old versions are still supported and available from the provider, which isn't really something you could test (or at least I don't know how).

scs-0214-v1 is as @mbuechse said a bit more tricky. If it is doable, a test could be written in Go with Sonobuoy in mind. If there isn't time for that, just doing a Python test would be fine and we can convert that later.
The main thing to test here is the requirement for control plane nodes to be distributed over multiple nodes. The standard even calls for physical nodes, but that could be kinda hard to do.
The best thing to do would be to check the IPs of the control plane (and maybe worker nodes) and look for a difference there. This would at least guarantee multiple VMs used. Now to check if they're separated over different machines or even failure zones is harder and probably dependent on the infrastructure of the provider. If there would be access to OpenStack as well (or another IaaS solution) we could check that with something like Failure Zones, but this is maybe too much for now (?). Would be dependent on the available time overall.
Now if these failure zones or zones in general are strictly separated over e.g. multiple racks or buildings, we can't really check, since this is something the provider would know and the metadata necessary for that is probably not publicly available or can't be accessed for some reason or another.

@martinmo
Copy link
Member

martinmo commented Feb 23, 2024

For conformance testing of scs-0210-v2 ("K8S Version Policy") I have the following proposal.

The check consists of two stages:

  1. Getting the list of currently supported K8s versions for a given KaaS provider (the new and maybe tricky part). It can optionally be saved into a YAML file.
  2. Deciding if this list, which can optionally be read from a YAML file, conforms to scs-0210-v2. For this, a lot of the the existing v1 test implementation can be reused, because the basic predicate "Is k8s version X.Y.Z recent enough?" of course still needs to be decided.

Stage 1 has multiple strategies:

  • Strategy A: only works for providers that use Gardener to manage the customer's K8s clusters, which are called "shoots" in that context. According to the shoot admin docs, CloudProfiles should give us exactly the information that we need. (However, I do currently not have access to a Gardener setup to verify that idea. Any hints/help would be appreciated!)
  • Strategy B: works for any provider and requires them to give us access to three K8s clusters of different "profiles" plus the corresponding Kubeconfigs with the suggested names stable, oldstable, oldoldstable (greetings to Debian). We then gather the K8s versions by connecting to these clusters.
  • Strategy C (fallback): means manual collection and creation of said YAML file.

A downside of strategy B is the cost to provide these otherwise unused clusters and the manual labour which is required whenever the provider switches the stable profile, e.g., from 1.27 to 1.28.

The stages can be implemented independent of each other. Furthermore, the design allows us to add other strategies later on, if needed.

@martinmo
Copy link
Member

martinmo commented Feb 23, 2024

I've asked the members of the Team Container Matrix chat for feedback. Independent of that I'll continue to implement second stage of the described design.

@artificial-intelligence
Copy link
Contributor

So for the first part I think every testing strategy must be as close to distribution agnostic as it can be, so different k8s deployment/distribution variants can be tested, as long as they are CNCF certified compliant k8s variants. There is still a lot of movement in k8s distributions/installers development so it seems wise to stay agile and not be dependent on a single blessed cluster installer.

So imho "Strategy A" is already out.

another comment I have: notice that Kubernetes doesn't necessarily "only" supports the last three releases, even if it says so in it's release schedule, e.g. currently there are in fact 4 supported releases listed at https://kubernetes.io/releases/ even if version 1.26 will be EOL at 28.02.2024 (so next week).

I know this is pretty weird, but well, it is what it is.

in april, according to the schedule k8s 1.30 will be released bumping the number of supported releases to 4 again.

So keep that in mind while writing test 😉

So I'm strongly in favor of option "B", because we really can only do meaningful testing when we exercise the actual code, that is, connect to a running cluster and see which version it is. There's imho not much value gained from just checking a manually provided yaml file or something.

However let me also add some more notes.

Please bear with me, but I think the standard mandates as a MUST:

Kubernetes versions MUST be supported as long as the official sources support them. The current support period can therefore be found in [Kubernetes Support Period].

afaik we only currently test if all the provided versions from a CSP are still supported, we do not test at the day before a release goes EOL if it is still listed/possible to create at the CSP.

This makes me also realize, that it is not really defined what "supported" means here, because arguably you maybe don't want customers to create a new cluster on a version that will be EOL tomorrow.

A more reasonable definition of "supported" might include providing security updates for cluster versions until they are EOL and be able to reinstall the "same cluster" until it is EOL, but maybe restrict new installation of a soon-to-be EOL version to e.g. one month prior to EOL?

I realize we will probably not be able to extend the tests even further, but I think it still is worth mentioning to maybe fix this later by either changing the standard or extend the tests.

PS: I notice the standard has also a missing link for [Kubernetes Support Period].
I'll provide a PR to fix that.

@martinmo
Copy link
Member

Thanks @artificial-intelligence for your feedback and perspective.

There is still a lot of movement in k8s distributions/installers development so it seems wise to stay agile and not be dependent on a single blessed cluster installer.

That's a fair point, I agree.

So I'm strongly in favor of option "B", because we really can only do meaningful testing when we exercise the actual code, that is, connect to a running cluster and see which version it is.

I'll focus on this (to be honest it is also my favorite approach).

There's imho not much value gained from just checking a manually provided yaml file or something.

Still, it's at least useful for me as a test implementer, for testing the tests ;-)

[…] notice that Kubernetes doesn't necessarily "only" supports the last three releases, even if it says so in it's release schedule […]

Ack. Because testing only three "profiles" is not enough, I will extend the number of profiles with another oldoldoldstable (prettier name will come soon). It will be tested only if it makes any sense, i.e., if there are four non-EOL minor k8s versions at the time the check is performed.

afaik we only currently test if all the provided versions from a CSP are still supported, we do not test at the day before a release goes EOL if it is still listed/possible to create at the CSP.
This makes me also realize, that it is not really defined what "supported" means here, because arguably you maybe don't want customers to create a new cluster on a version that will be EOL tomorrow.

I agree with you. At the moment, I think the "supported" in the standard roughly means "a minor version of k8s which is not EOL". Using this definition might not be the best choice for a customer for the reasons you pointed out. However, I also think this is a change to the standard itself that needs to be discussed in separate issue.

@garloff
Copy link
Member

garloff commented Feb 23, 2024

The main requirement here is to get providers to deliver patches (and minor releases) fast enough. So that would be my primary concern for testing. A provider disabling the provisioning of an old version a bit too early is not nice, but not nearly as likely and also not nearly as problematic for the user.

So testing strategy would be from my perspective:

  1. At the test date, determine which versions should be available. This will cover 3 to 4 minor k8s versions with the latest patch level (and sometimes the patch level one before may still be allowable).
  2. Try to create clusters with these k8s versions. (Sometimes, if this fails and the previous patch level is still allowable, retry with it).
  3. All tests must result in working clusters (sometimes after retrying, see above).

There is one tiny challenge: Step 2 is implementation dependant. We can implement a test for cluster stacks (and we could also cover KaaS v1 if we wanted to).
So I guess we would do it for our own solution and require 3rd partly KaaS to contribute a test case for their specific cluster management before we certify them...

@jschoone jschoone added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Feb 24, 2024
@martinmo
Copy link
Member

Thanks @garloff for the feedback! I agree with the primary goals you described.

However, the test methods steps 2 and 3 would deviate a lot from the current implementation for scs-0210-v1, which merely checks the versions of an existing cluster. My proposal for v2 tests is an extension of this for multiple clusters.

To summarize and categorize the current proposed approaches for scs-0210-v2 testing:

  1. Check supported versions by gathering metadata.
    a. Get metadata from a KaaS management interface such as Gardener (impl specific).
    b. Get metadata by connecting to multiple existing clusters.
  2. Check supported versions by trying to create clusters with a KaaS management interface such as cluster stacks and smoke testing them (impl specific).

Because we're in a hurry, I think we need to settle for one of the simpler approaches, but leave the implementation open. @mbuechse what do you think?

@martinmo
Copy link
Member

After meeting with @jschoone we concluded that it makes most sense if I stick to option 1b, i.e., "Get metadata by connecting to multiple existing clusters" because it is the most simple one to implement. It also mimics the way we connect to OpenStack for the IaaS conformance tests (we use a central cloud.yaml with multiple clouds plus a secrets.yaml created from GH secrets in the GH actions workflow).

As I've stated before, other options/strategies can be added later on. However, in this context I also want to note that creating K8s clusters during the conformance tests will add delays and an additional source of (transient) errors.

For the other standard that needs to be tested (scs-0214-v1), my proposal involves simply querying the following well-known labels for the master and worker nodes:

  • kubernetes.io/hostname
  • topology.kubernetes.io/zone
  • topology.kubernetes.io/region

Unfortunately, we would need to revamp the standard to mandate that these labels need to be set. To my knowledge, this is not the default case.

@cah-hbaum
Copy link
Contributor

For the other standard that needs to be tested (scs-0214-v1), my proposal involves simply querying the following
well-known labels for the master and worker nodes:

kubernetes.io/hostname
topology.kubernetes.io/zone
topology.kubernetes.io/region

Unfortunately, we would need to revamp the standard to mandate that these labels need to be set. To my knowledge,
this is not the default case.

That sounds like a good idea. I would try to implement this tomorrow (no time today). Sadly, like you said, this also kinda mandates the creation of a v2 for the standard (since it is already stabilized now and therefore not really changeable). So we would probably create this test initially, but after the Certification scope went through, a revamp of this test and the standard will probably be necessary.

@cah-hbaum
Copy link
Contributor

Had a short exchange with @mbuechse yesterday, we probably won't do a v2 for the standard yet. Instead, I will write an update for the standard v1 and put that into a separate commit. We will then see, how that is accepted and if its fine, we will just update the standard.

cah-hbaum added a commit that referenced this issue Feb 27, 2024
This provides a basic node distribution test only operating based on the labels of the kubernetes cluster provided. A test will come back negative, if no distribution according to the labels can be detected.

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
The test section of the standard also needs to be updated, even if the standard is already stabilized.

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
@cah-hbaum
Copy link
Contributor

PR created: #486
Can already be checked, if errors are found, I can try to fix them around 3-4PM today or later this evening.

cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 27, 2024
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
@mbuechse mbuechse self-assigned this Feb 28, 2024
cah-hbaum added a commit that referenced this issue Feb 28, 2024
This provides a basic node distribution test only operating based on the labels of the kubernetes cluster provided. A test will come back negative, if no distribution according to the labels can be detected.

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 28, 2024
The test section of the standard also needs to be updated, even if the standard is already stabilized.

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
cah-hbaum added a commit that referenced this issue Feb 28, 2024
* Provide node distribution test (#481)

This provides a basic node distribution test only operating based on the labels of the kubernetes cluster provided. A test will come back negative, if no distribution according to the labels can be detected.

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>

* Update the test section in the standard (#481)

The test section of the standard also needs to be updated, even if the standard is already stabilized.

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>

---------

Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
mbuechse added a commit that referenced this issue Feb 28, 2024
resolves #481

Signed-off-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
mbuechse added a commit that referenced this issue Feb 28, 2024
resolves #481

Signed-off-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
@mbuechse mbuechse moved this from Backlog to Doing in Sovereign Cloud Stack Feb 29, 2024
martinmo added a commit that referenced this issue Mar 5, 2024
This adds the missing conformance tests for the version policy in scs-0210-v2, reusing the CVE collection and parsing code from the scs-0210-v1 Python script written by @cah-hbaum.

Refs: #481
Closes: SovereignCloudStack/issues#505

Co-authored-by: Hannes Baum <hannes.baum@cloudandheat.com>
Co-authored-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
Signed-off-by: Martin Morgenstern <martin.morgenstern@cloudandheat.com>
@martinmo
Copy link
Member

martinmo commented Mar 5, 2024

Closing this because working tests have been merged with #486 and #488. Limitations of the scs-0214-v1 aka node distribution tests will be worked on in #494.

@martinmo martinmo closed this as completed Mar 5, 2024
@github-project-automation github-project-automation bot moved this from Doing to Done in Sovereign Cloud Stack Mar 5, 2024
@martinmo martinmo mentioned this issue Jun 10, 2024
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action required Issues or pull requests that urgently require action Container Issues or pull requests relevant for Team 2: Container Infra and Tooling enhancement New feature or request SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10
Projects
Archived in project
Development

No branches or pull requests

6 participants