Skip to content

Commit

Permalink
Implemented Prometheus Rule for automated alerts (#193)
Browse files Browse the repository at this point in the history
feat(cluster): Prometheus Rule for automated alerts + runbooks for a basic set of alerts

* Renamed: `cluster.monitoring.enablePodMonitor` to `cluster.monitoring.podMonitor.enabled`
* New configuration option: `cluster.monitoring.prometheusRule.enabled` defaults to `true`

Signed-off-by: Itay Grudev <itay.grudev@essentim.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
  • Loading branch information
itay-grudev and gbartolini authored Mar 1, 2024
1 parent 001d787 commit b2088c4
Show file tree
Hide file tree
Showing 19 changed files with 908 additions and 33 deletions.
17 changes: 7 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,12 @@ docs: ## Generate charts' docs using helm-docs
(echo "Please, install https://github.com/norwoodj/helm-docs first" && exit 1)

.PHONY: schema
schema: ## Generate charts' schema usign helm schema-gen plugin
@helm schema-gen charts/cloudnative-pg/values.yaml > charts/cloudnative-pg/values.schema.json || \
(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
schema: cloudnative-pg-schema cluster-schema ## Generate charts' schema using helm-schema-gen

.PHONY: pgbench-deploy
pgbench-deploy: ## Installs pgbench chart
helm dependency update charts/pgbench
helm upgrade --install pgbench --atomic charts/pgbench
cloudnative-pg-schema:
@helm schema-gen charts/cloudnative-pg/values.yaml | cat > charts/cloudnative-pg/values.schema.json || \
(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)

.PHONY: pgbench-uninstall
pgbench-uninstall: ## Uninstalls cnpg-pgbench chart if present
@helm uninstall pgbench
cluster-schema:
@helm schema-gen charts/cluster/values.yaml | cat > charts/cluster/values.schema.json || \
(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
12 changes: 7 additions & 5 deletions charts/cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,9 @@ Additionally you can specify the following parameters:
```yaml
backups:
scheduledBackups:
- name: daily-backup
schedule: "0 0 0 * * *" # Daily at midnight
backupOwnerReference: self
- name: daily-backup
schedule: "0 0 0 * * *" # Daily at midnight
backupOwnerReference: self
```
Each backup adapter takes it's own set of parameters, listed in the [Configuration options](#Configuration-options) section
Expand Down Expand Up @@ -149,8 +149,10 @@ refer to the [CloudNativePG Documentation](https://cloudnative-pg.io/documentat
| cluster.instances | int | `3` | Number of instances |
| cluster.logLevel | string | `"info"` | The instances' log level, one of the following values: error, warning, info (default), debug, trace |
| cluster.monitoring.customQueries | list | `[]` | |
| cluster.monitoring.enablePodMonitor | bool | `false` | |
| cluster.postgresql | string | `nil` | Configuration of the PostgreSQL server See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration |
| cluster.monitoring.enabled | bool | `false` | |
| cluster.monitoring.podMonitor.enabled | bool | `true` | |
| cluster.monitoring.prometheusRule.enabled | bool | `true` | |
| cluster.postgresql | object | `{}` | Configuration of the PostgreSQL server See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration |
| cluster.primaryUpdateMethod | string | `"switchover"` | Method to follow to upgrade the primary server during a rolling update procedure, after all replicas have been successfully updated. It can be switchover (default) or in-place (restart). |
| cluster.primaryUpdateStrategy | string | `"unsupervised"` | Strategy to follow to upgrade the primary server during a rolling update procedure, after all replicas have been successfully updated: it can be automated (unsupervised - default) or manual (supervised) |
| cluster.priorityClassName | string | `""` | |
Expand Down
49 changes: 49 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHACritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
CNPGClusterHACritical
=====================

Meaning
-------

The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas.

This can happen during either a normal failover or automated minor version upgrades in a cluster with 2 or less
instances. The replaced instance may need some time to catch-up with the cluster primary instance.

This alarm will be always triggered if your cluster is configured to run with only 1 instance. In this case you
may want to silence it.

Impact
------

Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is
still online and able to serve queries, although connections to the `-ro` endpoint will fail.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
51 changes: 51 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
CNPGClusterHAWarning
====================

Meaning
-------

The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`.

This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you
may want to silence it.

Impact
------

Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able
to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity.

This can happen during a normal failover or automated minor version upgrades. The replaced instance may need some time
to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes.

At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
24 changes: 24 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CNPGClusterHighConnectionsCritical
==================================

Meaning
-------

This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 95% of its capacity.

Impact
------

At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Mitigation
----------

* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter.
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
24 changes: 24 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CNPGClusterHighConnectionsWarning
=================================

Meaning
-------

This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 85% of its capacity.

Impact
------

At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Mitigation
----------

* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter.
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterHighReplicationLag
=============================

Meaning
-------

This alert is triggered when the replication lag of the CloudNativePG cluster exceed `1s`.

Impact
------

High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data.
In the event of a failover, there may be data loss for the time period of the lag.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

High replication lag can be caused by a number of factors, including:
* Network issues
* High load on the primary or replicas
* Long running queries
* Suboptimal PostgreSQL configuration, in particular small numbers of `max_wal_senders`.

```yaml
kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;"
```

Mitigation
----------
28 changes: 28 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
CNPGClusterInstancesOnSameNode
============================

Meaning
-------

The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node.

Impact
------

A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Mitigation
----------

1. Verify you have more than a single node with no taints, preventing pods to be scheduled there.
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
3. For more information, please refer to the ["Scheduling"](https://cloudnative-pg.io/documentation/current/scheduling/) section in the documentation
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterLowDiskSpaceCritical
===============================

Meaning
-------

This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either:

* the PVC hosting the `PGDATA` (`storage` section)
* the PVC hosting WAL files (`walStorage` section), where applicable
* any PVC hosting a tablespace (`tablespaces` section)

Impact
------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
in downtime and data loss.

Diagnosis
---------

Mitigation
----------

If you experience issues with the WAL (Write-Ahead Logging) volume and have
set up continuous archiving, ensure that WAL archiving is functioning
correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal`
folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically
ensuring that the number of `ready` files does not increase linearly.
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterLowDiskSpaceWarning
==============================

Meaning
-------

This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either:

* the PVC hosting the `PGDATA` (`storage` section)
* the PVC hosting WAL files (`walStorage` section), where applicable
* any PVC hosting a tablespace (`tablespaces` section)

Impact
------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
in downtime and data loss.

Diagnosis
---------

Mitigation
----------

If you experience issues with the WAL (Write-Ahead Logging) volume and have
set up continuous archiving, ensure that WAL archiving is functioning
correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal`
folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically
ensuring that the number of `ready` files does not increase linearly.
43 changes: 43 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterOffline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
CNPGClusterOffline
==================

Meaning
-------

The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances.

Impact
------

Having an offline cluster means your applications will not be able to access the database, leading to potential service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
Loading

0 comments on commit b2088c4

Please sign in to comment.