From 1aefd1d34afb3f4cb91457c093c08658bbc1973d Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Thu, 30 Nov 2023 16:23:44 +0100 Subject: [PATCH 1/8] Add decisions on the contents of MVP-0 Signed-off-by: Oliver Kautz --- ...cs-0403-v1-csp-kaas-observability-stack.md | 118 ++++++++++++++++++ 1 file changed, 118 insertions(+) create mode 100644 Standards/scs-0403-v1-csp-kaas-observability-stack.md diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md new file mode 100644 index 000000000..4fd0f4089 --- /dev/null +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -0,0 +1,118 @@ +--- +title: Architecture for the Cloud Service provider Observability System for the KaaS Layer +type: Decision Record +status: Draft +track: Ops +--- + +# Introduction + +Cloud Service Providers offer a variaty of products to a customer. Those can include compute resources like virtual machines, networking and identity and access management. As customers of those services build their applications upon those offered services the service provider need to ensure a certain quality level of their offerings. This is done by observing the infrastructure. Observability systems are leverage different type of telemetry data which include: + +1. Metrics: Usually time series data about different parameters of a system which can include e.g. CPU usage, number of active requests, health status, etc. +2. Logs: Messages of software events during runtime +3. Traces: More developer oriented form of logging to provide insights to an application or to analyze request flows in distributed systems. + +Based on those data, an alerting system can be used to to send out notifications to an Operations Team if a system behaves abnormally. Base on the telemetry data the Operations Team can find the issue, work on it and mitigate future incidents. + +## Motivation + +Currently, only the IaaS Layer of the SCS Reference Implementation has a an Observability Stack consisting of tools like Prometheus, Grafana, and Alertmanager as well as several Exporters toextract monitoring data from the several OpenStack components and additional software that is involved in the Reference Implementation. As the Kubernetes as a Service Layer becomes more and more important and the work on the Cluster API approach to create customer clusters progresses further, an observability solution for this layer is also needed. CSP should be able to watch over customer clusters and intervene if cluster get in a malfunctioning state. For this, a toolset and architecture is need which is proposed in this ADR. + +## Requirements + +A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The results of the Survey (Questions with answers) were the following: + +1. What is your understanding of a managed Kubernetes Offering: + - Hassle-Free Installation and Maintainance (customer viewpoint); Providing Controlplane and worker nodes and responsibility for correct function but agnostic to workload + - Day0, 1 and 2 (~planning, provisioning, operations) full lifecyle management or let customer manages some parts of that, depending on customer contract + +2. What Type and Depth of observability is needed + - CPU, RAM, HDD and Network usage, Health and Function of Cluster Nodes, Controlplane and if desired Customer Workload + +3. Do you have an observabiltiy infrastructure, if yes, how it is built + - Grafana/Thanos/Prometheus/Loki/Promtail/Alertmanger Stack, i.e. [Example Infrastructure](https://raw.githubusercontent.com/dNationCloud/kubernetes-monitoring-stack/main/thanos-deployment-architecture.svg) + +4. Data Must haves + - CPU, RAM, Disk, Network + - HTTP Connectivity Metrics + - Control Plane and Pod metrics (States, Ready, etc.) + - Workload specific metrics + - Node Stats + - K8s resources (exporters, kubestate metrics, cadvisor, parts of the kubelet) + - Ingress controller exporter (http error rate, cert metrics like expiration date) + - K8s certs metrics + - Metrics of underlying node + - Logs of control plane, kubelet and containerd + +5. Must Not haves + - Secrets, otherwise as much as possible for anomaly detection over long time data + +6. Must have Alerts + - Dependent on SLAs and SLA Types, highly individual + - Use of [kubernetes-mixin alerts](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/alerts) and [dNation Alerts Ruleset](https://github.com/dNationCloud/kubernetes-monitoring/tree/main/jsonnet/rules) + +7. Must NOT Alert on + - Should not wake people, nothing that does not lead to Action items + +8. Observability from Within Or Outside KaaS. How does the architecture look like? + - Monitoring Infra on CSP Side + - Data from Customer Clusters and Mon Infra on CSP and KaaS, get both data. KaaS Monitoring can also be used by customer + +9. Special Constraints + - HA Setup in different Clusters on Different Sites + +## Design Considerations + +As the software components involved for the Observability solution were clear as Prometheus, Thanos, Loki, Promtail, Grafana, and Alertmanager are the industry standard tools to implement a observability solution in a Cloud-Native fashion on Kubernetes. + +The important question is how those tools are utilize and combined to an architecture to provide the needs stated in the requirements above. As dNation has a comprehensive observability system using the aforementioned tools as well as a set of dynamic dashboards. We decided on in the SIG Monitoring and Team Ops meetings that we want to leverage their toolbox as we don't wanted to start on a green field and the tooling is already available as Open Source Software. Their observability Stack is mainly used to observe several customer clusters and customer applications which includes all observability data that is needed. Typically those clusters are set up for the customer beforehand and the observability tools are installed for customer manually. + +For use of a CSP that provides Kubernetes as a Service the provisioning of the observability tools and the onboarding of a customer cluster need to be fully automated. For a customer, all the tools on their Kubernetes cluster needs to be installed at creation time and the observability data of that cluster needs to present in the Observer Cluster immediately. + +### Options considered + +#### Short Term Query Architecture + +In this setup, each customer cluster have Thanos and Prometheus installed in addition to Thanos and Prometheus on the Obvserver Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances are stored in an external Object Store of the CSP. + +#### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) + +Here, Thanos and Prometheus are only used on the CSP side to store and manage all observability data. For the customer clusters only the Prometheus Agent will be used. This introduces less complexity and resource consumption on the customer workload clusters. + +## Decisions + +1. The Hybrid approach was chosen +2. The Observability stack will be created based on the dNation observability stack +3. The MVP-0 will consist of the following features: + - Observability data from KaaS Clusters is scraped + - K8s cluster that hosts observer deployment is deployed + - S3 compatible bucket as a storage for long term metrics is configured + - thanos query-frontend is deployed and configured + - thanos query is deployed and configured + - thanos reciever is deployed and configured (simple deployment, non HA, without router) + - thanos ruler is deployed and configured + - thanos compactor is deployed and configured + - thanos bucket-web is deployed and configured + - thanos storegateway is deployed and configured + - prometheus server is deployed and configured + - prometheus alertmanager is deployed and configured + - prometheus black-box exporter is deployed and configured + - kaas-metric-importer is deployed and configured (service aims to differentiate between intentional deletion of KaaS instances and failures in the KaaS monitoring agent) + - Alerts are defined on the KaaS Clusters metrics + - all prometheus alerts are working as expected + - There exist Dashboards for KaaS Cluster Health + - KaaS L0 dashboard counters are working correctly + - Dedicated L0 dashboards are deployed for KaaS and for IaaS monitoring layers + - There exist Dashboards for SCS services endpoinds health (BlackBox exporter) + - There exist Dashboards for IaaS layer health + - Automatic Setup of Exporters for Observability of managed K8s clusters + - KaaS service is mocked + - VM that will host a mock of KaaS service is deployed + - a script that deploys a multiple KinD clusters and register them in observer is created + - Automatic Setup of Thanos sidecar for Observability of IaaS layer (testbed) + - IaaS service is mocked + - OSISM testbed is deployed + - implement an option to deploy thanos sidecar with some simple config in OSISM testbed + - There exist Dashboards for Harbor Registry Health + - Alerts are defined on the Harbor Registry metrics From 669ccd9bd87e324a7ef50424b2603c1277bef34a Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Thu, 30 Nov 2023 17:02:08 +0100 Subject: [PATCH 2/8] Add decision about use for IaaS layer Signed-off-by: Oliver Kautz --- .../scs-0403-v1-csp-kaas-observability-stack.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index 4fd0f4089..c7cd5bb0f 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -74,17 +74,27 @@ For use of a CSP that provides Kubernetes as a Service the provisioning of the o #### Short Term Query Architecture -In this setup, each customer cluster have Thanos and Prometheus installed in addition to Thanos and Prometheus on the Obvserver Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances are stored in an external Object Store of the CSP. +In this setup, each customer cluster have Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances are stored in an external Object Store of the CSP. #### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) Here, Thanos and Prometheus are only used on the CSP side to store and manage all observability data. For the customer clusters only the Prometheus Agent will be used. This introduces less complexity and resource consumption on the customer workload clusters. +#### Scope of the Observability Architecture + +The Observability Cluster and Archtiecture should be defined such that it can be used to not only observe the Kubernetes Layer of an SCS Stack, but also the IaaS and other Layers. + +#### Observing the Observability Infrastructure + +For a productive usage, it needs to be possible to observe the Observability Cluster itself. + ## Decisions -1. The Hybrid approach was chosen +1. The Hybrid approach was chosen over Short Term Query Architecture 2. The Observability stack will be created based on the dNation observability stack -3. The MVP-0 will consist of the following features: +3. The observability stack can be used as a standalone component to use with the Kubernetes Layer. It should be possible to observe other parts of an SCS Stack like the status of the OpenStack components, but this will not be mandatory. +4. The observability Stack should be designed that it is possible to provision to observer clusters side by side, observing each other. To do this is only a recommendation for productive usage. +5. The MVP-0 will consist of the following features: - Observability data from KaaS Clusters is scraped - K8s cluster that hosts observer deployment is deployed - S3 compatible bucket as a storage for long term metrics is configured From 2f51f483474e1c070df2492c85b40c4dc04f62da Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Thu, 30 Nov 2023 17:04:04 +0100 Subject: [PATCH 3/8] fix typos Signed-off-by: Oliver Kautz --- Standards/scs-0403-v1-csp-kaas-observability-stack.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index c7cd5bb0f..f58ba06eb 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -90,10 +90,10 @@ For a productive usage, it needs to be possible to observe the Observability Clu ## Decisions -1. The Hybrid approach was chosen over Short Term Query Architecture -2. The Observability stack will be created based on the dNation observability stack -3. The observability stack can be used as a standalone component to use with the Kubernetes Layer. It should be possible to observe other parts of an SCS Stack like the status of the OpenStack components, but this will not be mandatory. -4. The observability Stack should be designed that it is possible to provision to observer clusters side by side, observing each other. To do this is only a recommendation for productive usage. +1. The **Hybrid Approach** was chosen over Short Term Query Architecture +2. The Observability Stack will be created based on the dNation observability stack +3. The Observability Stack can be used as a standalone component to use with the Kubernetes Layer. It should be possible to observe other parts of an SCS Stack like the status of the OpenStack components, but this will not be mandatory. +4. The Observability Stack should be designed that it is possible to provision to observer clusters side by side, observing each other. To do this is only a recommendation for productive usage. 5. The MVP-0 will consist of the following features: - Observability data from KaaS Clusters is scraped - K8s cluster that hosts observer deployment is deployed From 1d17d89adaf09b9a00b69bda2b7abd2ecdd1e73f Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Thu, 30 Nov 2023 17:12:26 +0100 Subject: [PATCH 4/8] Formatting Headings Signed-off-by: Oliver Kautz --- ...scs-0403-v1-csp-kaas-observability-stack.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index f58ba06eb..d6549d89b 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -15,11 +15,11 @@ Cloud Service Providers offer a variaty of products to a customer. Those can inc Based on those data, an alerting system can be used to to send out notifications to an Operations Team if a system behaves abnormally. Base on the telemetry data the Operations Team can find the issue, work on it and mitigate future incidents. -## Motivation +# Motivation Currently, only the IaaS Layer of the SCS Reference Implementation has a an Observability Stack consisting of tools like Prometheus, Grafana, and Alertmanager as well as several Exporters toextract monitoring data from the several OpenStack components and additional software that is involved in the Reference Implementation. As the Kubernetes as a Service Layer becomes more and more important and the work on the Cluster API approach to create customer clusters progresses further, an observability solution for this layer is also needed. CSP should be able to watch over customer clusters and intervene if cluster get in a malfunctioning state. For this, a toolset and architecture is need which is proposed in this ADR. -## Requirements +# Requirements A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The results of the Survey (Questions with answers) were the following: @@ -62,7 +62,7 @@ A survey was conducted to gather the needs and requirements of a CSP when provid 9. Special Constraints - HA Setup in different Clusters on Different Sites -## Design Considerations +# Design Considerations As the software components involved for the Observability solution were clear as Prometheus, Thanos, Loki, Promtail, Grafana, and Alertmanager are the industry standard tools to implement a observability solution in a Cloud-Native fashion on Kubernetes. @@ -70,25 +70,25 @@ The important question is how those tools are utilize and combined to an archite For use of a CSP that provides Kubernetes as a Service the provisioning of the observability tools and the onboarding of a customer cluster need to be fully automated. For a customer, all the tools on their Kubernetes cluster needs to be installed at creation time and the observability data of that cluster needs to present in the Observer Cluster immediately. -### Options considered +## Options considered -#### Short Term Query Architecture +### Short Term Query Architecture In this setup, each customer cluster have Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances are stored in an external Object Store of the CSP. -#### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) +### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) Here, Thanos and Prometheus are only used on the CSP side to store and manage all observability data. For the customer clusters only the Prometheus Agent will be used. This introduces less complexity and resource consumption on the customer workload clusters. -#### Scope of the Observability Architecture +### Scope of the Observability Architecture The Observability Cluster and Archtiecture should be defined such that it can be used to not only observe the Kubernetes Layer of an SCS Stack, but also the IaaS and other Layers. -#### Observing the Observability Infrastructure +### Observing the Observability Infrastructure For a productive usage, it needs to be possible to observe the Observability Cluster itself. -## Decisions +# Decisions 1. The **Hybrid Approach** was chosen over Short Term Query Architecture 2. The Observability Stack will be created based on the dNation observability stack From 6f10e9fb58e32f7188073e3a8f48bf6195c1cba2 Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Thu, 30 Nov 2023 17:18:19 +0100 Subject: [PATCH 5/8] Fix headings Signed-off-by: Oliver Kautz --- ...cs-0403-v1-csp-kaas-observability-stack.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index d6549d89b..ce8eebe34 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -5,7 +5,7 @@ status: Draft track: Ops --- -# Introduction +## Introduction Cloud Service Providers offer a variaty of products to a customer. Those can include compute resources like virtual machines, networking and identity and access management. As customers of those services build their applications upon those offered services the service provider need to ensure a certain quality level of their offerings. This is done by observing the infrastructure. Observability systems are leverage different type of telemetry data which include: @@ -15,11 +15,11 @@ Cloud Service Providers offer a variaty of products to a customer. Those can inc Based on those data, an alerting system can be used to to send out notifications to an Operations Team if a system behaves abnormally. Base on the telemetry data the Operations Team can find the issue, work on it and mitigate future incidents. -# Motivation +## Motivation Currently, only the IaaS Layer of the SCS Reference Implementation has a an Observability Stack consisting of tools like Prometheus, Grafana, and Alertmanager as well as several Exporters toextract monitoring data from the several OpenStack components and additional software that is involved in the Reference Implementation. As the Kubernetes as a Service Layer becomes more and more important and the work on the Cluster API approach to create customer clusters progresses further, an observability solution for this layer is also needed. CSP should be able to watch over customer clusters and intervene if cluster get in a malfunctioning state. For this, a toolset and architecture is need which is proposed in this ADR. -# Requirements +## Requirements A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The results of the Survey (Questions with answers) were the following: @@ -62,7 +62,7 @@ A survey was conducted to gather the needs and requirements of a CSP when provid 9. Special Constraints - HA Setup in different Clusters on Different Sites -# Design Considerations +## Design Considerations As the software components involved for the Observability solution were clear as Prometheus, Thanos, Loki, Promtail, Grafana, and Alertmanager are the industry standard tools to implement a observability solution in a Cloud-Native fashion on Kubernetes. @@ -70,25 +70,25 @@ The important question is how those tools are utilize and combined to an archite For use of a CSP that provides Kubernetes as a Service the provisioning of the observability tools and the onboarding of a customer cluster need to be fully automated. For a customer, all the tools on their Kubernetes cluster needs to be installed at creation time and the observability data of that cluster needs to present in the Observer Cluster immediately. -## Options considered +### Options considered -### Short Term Query Architecture +#### Short Term Query Architecture In this setup, each customer cluster have Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances are stored in an external Object Store of the CSP. -### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) +#### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) Here, Thanos and Prometheus are only used on the CSP side to store and manage all observability data. For the customer clusters only the Prometheus Agent will be used. This introduces less complexity and resource consumption on the customer workload clusters. -### Scope of the Observability Architecture +#### Scope of the Observability Architecture The Observability Cluster and Archtiecture should be defined such that it can be used to not only observe the Kubernetes Layer of an SCS Stack, but also the IaaS and other Layers. -### Observing the Observability Infrastructure +#### Observing the Observability Infrastructure For a productive usage, it needs to be possible to observe the Observability Cluster itself. -# Decisions +## Decisions 1. The **Hybrid Approach** was chosen over Short Term Query Architecture 2. The Observability Stack will be created based on the dNation observability stack From b6d223e419f88088cdeee3e0151fb60d028fa09d Mon Sep 17 00:00:00 2001 From: Oliver Kautz <69149308+o-otte@users.noreply.github.com> Date: Tue, 19 Dec 2023 09:53:59 +0100 Subject: [PATCH 6/8] Apply suggestions from code review Fix Typos Co-authored-by: Matej Feder Co-authored-by: Sven Signed-off-by: Oliver Kautz <69149308+o-otte@users.noreply.github.com> --- ...scs-0403-v1-csp-kaas-observability-stack.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index ce8eebe34..802971ca5 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -7,17 +7,17 @@ track: Ops ## Introduction -Cloud Service Providers offer a variaty of products to a customer. Those can include compute resources like virtual machines, networking and identity and access management. As customers of those services build their applications upon those offered services the service provider need to ensure a certain quality level of their offerings. This is done by observing the infrastructure. Observability systems are leverage different type of telemetry data which include: +Cloud Service Providers offer a variety of products to a customer. Those can include compute resources like virtual machines, networking, and identity and access management. As customers of those services build their applications upon those offered services the service provider needs to ensure a certain quality level of their offerings. This is done by observing the infrastructure. Observability systems leverage different types of telemetry data which include: 1. Metrics: Usually time series data about different parameters of a system which can include e.g. CPU usage, number of active requests, health status, etc. 2. Logs: Messages of software events during runtime -3. Traces: More developer oriented form of logging to provide insights to an application or to analyze request flows in distributed systems. +3. Traces: A more developer-oriented form of logging to provide insights into an application or to analyze request flows in distributed systems. -Based on those data, an alerting system can be used to to send out notifications to an Operations Team if a system behaves abnormally. Base on the telemetry data the Operations Team can find the issue, work on it and mitigate future incidents. +Based on those data, an alerting system can be used to send out notifications to an Operations Team if a system behaves abnormally. Based on the telemetry data the Operations Team can find the issue, work on it, and mitigate future incidents. ## Motivation -Currently, only the IaaS Layer of the SCS Reference Implementation has a an Observability Stack consisting of tools like Prometheus, Grafana, and Alertmanager as well as several Exporters toextract monitoring data from the several OpenStack components and additional software that is involved in the Reference Implementation. As the Kubernetes as a Service Layer becomes more and more important and the work on the Cluster API approach to create customer clusters progresses further, an observability solution for this layer is also needed. CSP should be able to watch over customer clusters and intervene if cluster get in a malfunctioning state. For this, a toolset and architecture is need which is proposed in this ADR. +Currently, only the IaaS Layer of the SCS Reference Implementation has an Observability Stack consisting of tools like Prometheus, Grafana, and Alertmanager as well as several Exporters to extract monitoring data from the several OpenStack components and additional software that is involved in the Reference Implementation. As the Kubernetes as a Service Layer becomes more and more important and the work on the Cluster API approach to create customer clusters progresses further, an observability solution for this layer is also needed. CSP should be able to watch over customer clusters and intervene if a cluster gets in a malfunctioning state. For this, a toolset and architecture are needed which is proposed in this ADR. ## Requirements @@ -66,15 +66,15 @@ A survey was conducted to gather the needs and requirements of a CSP when provid As the software components involved for the Observability solution were clear as Prometheus, Thanos, Loki, Promtail, Grafana, and Alertmanager are the industry standard tools to implement a observability solution in a Cloud-Native fashion on Kubernetes. -The important question is how those tools are utilize and combined to an architecture to provide the needs stated in the requirements above. As dNation has a comprehensive observability system using the aforementioned tools as well as a set of dynamic dashboards. We decided on in the SIG Monitoring and Team Ops meetings that we want to leverage their toolbox as we don't wanted to start on a green field and the tooling is already available as Open Source Software. Their observability Stack is mainly used to observe several customer clusters and customer applications which includes all observability data that is needed. Typically those clusters are set up for the customer beforehand and the observability tools are installed for customer manually. +The important question is how those tools are utilized and combined to an architecture to provide the needs stated in the requirements above. As dNation has a comprehensive observability system using the aforementioned tools as well as a set of dynamic dashboards. We decided on in the SIG Monitoring and Team Ops meetings that we want to leverage their toolbox as we don't wanted to start on a green field and the tooling is already available as Open Source Software. Their observability Stack is mainly used to observe several customer clusters and customer applications which includes all observability data that is needed. Typically those clusters are set up for the customer beforehand and the observability tools are installed for customers manually. -For use of a CSP that provides Kubernetes as a Service the provisioning of the observability tools and the onboarding of a customer cluster need to be fully automated. For a customer, all the tools on their Kubernetes cluster needs to be installed at creation time and the observability data of that cluster needs to present in the Observer Cluster immediately. +For use of a CSP that provides Kubernetes as a Service the provisioning of the observability tools and the onboarding of a customer cluster need to be fully automated. For a customer, all the tools on their Kubernetes cluster needs to be installed at creation time and the observability data of that cluster needs to be present in the Observer Cluster immediately. ### Options considered #### Short Term Query Architecture -In this setup, each customer cluster have Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances are stored in an external Object Store of the CSP. +In this setup, each customer cluster has Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances is stored in an external Object Store of the CSP. #### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) @@ -86,14 +86,14 @@ The Observability Cluster and Archtiecture should be defined such that it can be #### Observing the Observability Infrastructure -For a productive usage, it needs to be possible to observe the Observability Cluster itself. +For usage in production, it needs to be possible to observe the Observability Cluster itself. ## Decisions 1. The **Hybrid Approach** was chosen over Short Term Query Architecture 2. The Observability Stack will be created based on the dNation observability stack 3. The Observability Stack can be used as a standalone component to use with the Kubernetes Layer. It should be possible to observe other parts of an SCS Stack like the status of the OpenStack components, but this will not be mandatory. -4. The Observability Stack should be designed that it is possible to provision to observer clusters side by side, observing each other. To do this is only a recommendation for productive usage. +4. The Observability Stack should be designed that it is possible to provision two observer clusters side by side, observing each other. To do this is only a recommendation for production usage. 5. The MVP-0 will consist of the following features: - Observability data from KaaS Clusters is scraped - K8s cluster that hosts observer deployment is deployed From 222bb132cdac10084dc30ea2fe89b2261eb21d30 Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Tue, 19 Dec 2023 18:10:13 +0100 Subject: [PATCH 7/8] Improve sections. - Move Survey results to new references section - Refactored short-term and hybrid approaches to pull and push based approaches - More precise Requirements section - fix of some typos Signed-off-by: Oliver Kautz --- ...cs-0403-v1-csp-kaas-observability-stack.md | 128 +++++++++++------- 1 file changed, 78 insertions(+), 50 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index 802971ca5..dc29b2589 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -21,80 +21,63 @@ Currently, only the IaaS Layer of the SCS Reference Implementation has an Observ ## Requirements -A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The results of the Survey (Questions with answers) were the following: - -1. What is your understanding of a managed Kubernetes Offering: - - Hassle-Free Installation and Maintainance (customer viewpoint); Providing Controlplane and worker nodes and responsibility for correct function but agnostic to workload - - Day0, 1 and 2 (~planning, provisioning, operations) full lifecyle management or let customer manages some parts of that, depending on customer contract - -2. What Type and Depth of observability is needed - - CPU, RAM, HDD and Network usage, Health and Function of Cluster Nodes, Controlplane and if desired Customer Workload - -3. Do you have an observabiltiy infrastructure, if yes, how it is built - - Grafana/Thanos/Prometheus/Loki/Promtail/Alertmanger Stack, i.e. [Example Infrastructure](https://raw.githubusercontent.com/dNationCloud/kubernetes-monitoring-stack/main/thanos-deployment-architecture.svg) +A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The feedback of the survey led to the following requirement on a Kubernetes as a Service Observability System: -4. Data Must haves +- Telemetry Data that MUST be fetched: - CPU, RAM, Disk, Network - HTTP Connectivity Metrics - Control Plane and Pod metrics (States, Ready, etc.) - - Workload specific metrics - - Node Stats - - K8s resources (exporters, kubestate metrics, cadvisor, parts of the kubelet) - - Ingress controller exporter (http error rate, cert metrics like expiration date) - K8s certs metrics - Metrics of underlying node - Logs of control plane, kubelet and containerd +- Telemetry Data that MAY be fetched: + - K8s resources (exporters, kubestate metrics, cadvisor, parts of the kubelet) + - Ingress controller exporter (http error rate, cert metrics like expiration date) +- Telemetry Data that SHOULD NOT BE fetched: + - Any metrics or logs a CSP does not need to provide support with respect to their SLA with a Customer. +- Telemetry Data that MUST NOT be fetched: + - Secrets + - Customer Specific Workload Metrics +- The Alerting Mechanism MUST include a default ruleset +- The Observability Stack MUST run on the CSP Infrastructure +- The Observability Stack MUST be High Available +- The Observability Stack MUST be able to observe itself +- Observed Clusters SHOULD have a low resource impact on the used software to provide telemetry data for the Observability Stack -5. Must Not haves - - Secrets, otherwise as much as possible for anomaly detection over long time data - -6. Must have Alerts - - Dependent on SLAs and SLA Types, highly individual - - Use of [kubernetes-mixin alerts](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/alerts) and [dNation Alerts Ruleset](https://github.com/dNationCloud/kubernetes-monitoring/tree/main/jsonnet/rules) - -7. Must NOT Alert on - - Should not wake people, nothing that does not lead to Action items - -8. Observability from Within Or Outside KaaS. How does the architecture look like? - - Monitoring Infra on CSP Side - - Data from Customer Clusters and Mon Infra on CSP and KaaS, get both data. KaaS Monitoring can also be used by customer - -9. Special Constraints - - HA Setup in different Clusters on Different Sites - -## Design Considerations - -As the software components involved for the Observability solution were clear as Prometheus, Thanos, Loki, Promtail, Grafana, and Alertmanager are the industry standard tools to implement a observability solution in a Cloud-Native fashion on Kubernetes. - -The important question is how those tools are utilized and combined to an architecture to provide the needs stated in the requirements above. As dNation has a comprehensive observability system using the aforementioned tools as well as a set of dynamic dashboards. We decided on in the SIG Monitoring and Team Ops meetings that we want to leverage their toolbox as we don't wanted to start on a green field and the tooling is already available as Open Source Software. Their observability Stack is mainly used to observe several customer clusters and customer applications which includes all observability data that is needed. Typically those clusters are set up for the customer beforehand and the observability tools are installed for customers manually. +### Options considered -For use of a CSP that provides Kubernetes as a Service the provisioning of the observability tools and the onboarding of a customer cluster need to be fully automated. For a customer, all the tools on their Kubernetes cluster needs to be installed at creation time and the observability data of that cluster needs to be present in the Observer Cluster immediately. +#### Use of the dNation Observability Stack as a base -### Options considered +The [dNation monitoring stack](https://github.com/dNationCloud/kubernetes-monitoring) offers a lot of basic capabilities needed on an observability stack for Kubernetes like Prometheus Operator, Grafana, Alertmanager, Loki, Promtail and Thanos. -#### Short Term Query Architecture +#### Pull-based Architecture -In this setup, each customer cluster has Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. The customer clusters Thanos installation is used for short term queries, as for long term queries the data of all Thanos instances is stored in an external Object Store of the CSP. +Each customer cluster has Thanos and Prometheus installed in addition to Thanos and Prometheus on the Observer Cluster. Metrics of a customer cluster are pulled from Thanos (Customer Cluster) for short term queries, as for long term queries the data of all Thanos instances is stored in an external Object Store of the CSP. -#### Hybrid Approach (query for short term metrics & remote write of metrics for KaaS) +#### Push-based Archtitecture -Here, Thanos and Prometheus are only used on the CSP side to store and manage all observability data. For the customer clusters only the Prometheus Agent will be used. This introduces less complexity and resource consumption on the customer workload clusters. +Here, Thanos and Prometheus are only used on the CSP side to store and manage all observability data. For the customer clusters only the Prometheus Agent will be used. Prometheus Agent will push all metrics of a Customer Cluster to the central Thanos instance and is preserved in an external Object Store. This introduces less complexity and resource consumption on the customer workload clusters. #### Scope of the Observability Architecture -The Observability Cluster and Archtiecture should be defined such that it can be used to not only observe the Kubernetes Layer of an SCS Stack, but also the IaaS and other Layers. +The Observability Cluster and Architecture SHOULD be defined in a modular way so that it can be used to not only observe the Kubernetes Layer of an SCS Stack, but every aspect of an SCS Stack. #### Observing the Observability Infrastructure For usage in production, it needs to be possible to observe the Observability Cluster itself. +#### Alerting Rulesets + +Use a mix of [kubernetes-mixin alerts](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/alerts) and [dNation Alerts Ruleset](https://github.com/dNationCloud/kubernetes-monitoring/tree/main/jsonnet/rules), as they offer an extensive and well reviewed set of default Alerts covering the important Parts of a Kubernetes Deployment (Nodes, Controlplane, K8s Resources, etc.) + ## Decisions -1. The **Hybrid Approach** was chosen over Short Term Query Architecture -2. The Observability Stack will be created based on the dNation observability stack -3. The Observability Stack can be used as a standalone component to use with the Kubernetes Layer. It should be possible to observe other parts of an SCS Stack like the status of the OpenStack components, but this will not be mandatory. -4. The Observability Stack should be designed that it is possible to provision two observer clusters side by side, observing each other. To do this is only a recommendation for production usage. -5. The MVP-0 will consist of the following features: +1. Base the MVP-0 Implementation on the dNation Kubernetes Monitoring Stack. +2. The **Push-based** Architecture was chosen over the Pull-based Approach. +3. The Observability Stack will be created based on the dNation observability stack +4. The Observability Stack can be used as a standalone component to use with the Kubernetes Layer. It should be possible to observe other parts of an SCS Stack like the status of the OpenStack components, but this will not be mandatory. +5. The Observability Stack should be designed that it is possible to provision two observer clusters side by side, observing each other. To do this is only a recommendation for production usage. +6. The MVP-0 will consist of the following features: - Observability data from KaaS Clusters is scraped - K8s cluster that hosts observer deployment is deployed - S3 compatible bucket as a storage for long term metrics is configured @@ -126,3 +109,48 @@ For usage in production, it needs to be possible to observe the Observability Cl - implement an option to deploy thanos sidecar with some simple config in OSISM testbed - There exist Dashboards for Harbor Registry Health - Alerts are defined on the Harbor Registry metrics + +## Reference + +### Outcome of the CSP Survey about Requirements for KaaS Observability + +A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The results of the Survey (Questions with answers) were the following: + +1. What is your understanding of a managed Kubernetes Offering: + - Hassle-Free Installation and Maintainance (customer viewpoint); Providing Controlplane and worker nodes and responsibility for correct function but agnostic to workload + - Day0, 1 and 2 (~planning, provisioning, operations) full lifecyle management or let customer manages some parts of that, depending on customer contract + +2. What Type and Depth of observability is needed + - CPU, RAM, HDD and Network usage, Health and Function of Cluster Nodes, Controlplane and if desired Customer Workload + +3. Do you have an observabiltiy infrastructure, if yes, how it is built + - Grafana/Thanos/Prometheus/Loki/Promtail/Alertmanger Stack, i.e. [Example Infrastructure](https://raw.githubusercontent.com/dNationCloud/kubernetes-monitoring-stack/main/thanos-deployment-architecture.svg) + +4. Data Must haves + - CPU, RAM, Disk, Network + - HTTP Connectivity Metrics + - Control Plane and Pod metrics (States, Ready, etc.) + - Workload specific metrics + - Node Stats + - K8s resources (exporters, kubestate metrics, cadvisor, parts of the kubelet) + - Ingress controller exporter (http error rate, cert metrics like expiration date) + - K8s certs metrics + - Metrics of underlying node + - Logs of control plane, kubelet and containerd + +5. Must Not haves + - Secrets, otherwise as much as possible for anomaly detection over long time data + +6. Must have Alerts + - Dependent on SLAs and SLA Types, highly individual + - Use of [kubernetes-mixin alerts](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/alerts) and [dNation Alerts Ruleset](https://github.com/dNationCloud/kubernetes-monitoring/tree/main/jsonnet/rules) + +7. Must NOT Alert on + - Should not wake people, nothing that does not lead to Action items + +8. Observability from Within Or Outside KaaS. How does the architecture look like? + - Monitoring Infra on CSP Side + - Data from Customer Clusters and Mon Infra on CSP and KaaS, get both data. KaaS Monitoring can also be used by customer + +9. Special Constraints + - HA Setup in different Clusters on Different Sites From d4ca5784802e5068e6bf8e9b6bf0a3bbbae98881 Mon Sep 17 00:00:00 2001 From: Oliver Kautz Date: Tue, 19 Dec 2023 18:15:37 +0100 Subject: [PATCH 8/8] Fix formatting Signed-off-by: Oliver Kautz --- ...cs-0403-v1-csp-kaas-observability-stack.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/Standards/scs-0403-v1-csp-kaas-observability-stack.md b/Standards/scs-0403-v1-csp-kaas-observability-stack.md index dc29b2589..f8d0d3523 100644 --- a/Standards/scs-0403-v1-csp-kaas-observability-stack.md +++ b/Standards/scs-0403-v1-csp-kaas-observability-stack.md @@ -24,21 +24,21 @@ Currently, only the IaaS Layer of the SCS Reference Implementation has an Observ A survey was conducted to gather the needs and requirements of a CSP when providing Kubernetes as a Service. The feedback of the survey led to the following requirement on a Kubernetes as a Service Observability System: - Telemetry Data that MUST be fetched: - - CPU, RAM, Disk, Network - - HTTP Connectivity Metrics - - Control Plane and Pod metrics (States, Ready, etc.) - - K8s certs metrics - - Metrics of underlying node - - Logs of control plane, kubelet and containerd + - CPU, RAM, Disk, Network + - HTTP Connectivity Metrics + - Control Plane and Pod metrics (States, Ready, etc.) + - K8s certs metrics + - Metrics of underlying node + - Logs of control plane, kubelet and containerd - Telemetry Data that MAY be fetched: - - K8s resources (exporters, kubestate metrics, cadvisor, parts of the kubelet) - - Ingress controller exporter (http error rate, cert metrics like expiration date) + - K8s resources (exporters, kubestate metrics, cadvisor, parts of the kubelet) + - Ingress controller exporter (http error rate, cert metrics like expiration date) - Telemetry Data that SHOULD NOT BE fetched: - - Any metrics or logs a CSP does not need to provide support with respect to their SLA with a Customer. + - Any metrics or logs a CSP does not need to provide support with respect to their SLA with a Customer. - Telemetry Data that MUST NOT be fetched: - - Secrets - - Customer Specific Workload Metrics -- The Alerting Mechanism MUST include a default ruleset + - Secrets + - Customer Specific Workload Metrics +- The Alerting Mechanism MUST include a default ruleset - The Observability Stack MUST run on the CSP Infrastructure - The Observability Stack MUST be High Available - The Observability Stack MUST be able to observe itself