-
Notifications
You must be signed in to change notification settings - Fork 109
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implemented Prometheus Rule for automated alerts (#193)
feat(cluster): Prometheus Rule for automated alerts + runbooks for a basic set of alerts * Renamed: `cluster.monitoring.enablePodMonitor` to `cluster.monitoring.podMonitor.enabled` * New configuration option: `cluster.monitoring.prometheusRule.enabled` defaults to `true` Signed-off-by: Itay Grudev <itay.grudev@essentim.com> Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
- Loading branch information
1 parent
001d787
commit b2088c4
Showing
19 changed files
with
908 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
CNPGClusterHACritical | ||
===================== | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas. | ||
|
||
This can happen during either a normal failover or automated minor version upgrades in a cluster with 2 or less | ||
instances. The replaced instance may need some time to catch-up with the cluster primary instance. | ||
|
||
This alarm will be always triggered if your cluster is configured to run with only 1 instance. In this case you | ||
may want to silence it. | ||
|
||
Impact | ||
------ | ||
|
||
Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is | ||
still online and able to serve queries, although connections to the `-ro` endpoint will fail. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Check the logs of the affected CloudNativePG instances: | ||
|
||
```bash | ||
kubectl logs --namespace <namespace> pod/<instance-pod-name> | ||
``` | ||
|
||
Check the CloudNativePG operator logs: | ||
|
||
```bash | ||
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg" | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/) | ||
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for | ||
more information on how to troubleshoot and mitigate this issue. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
CNPGClusterHAWarning | ||
==================== | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`. | ||
|
||
This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you | ||
may want to silence it. | ||
|
||
Impact | ||
------ | ||
|
||
Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able | ||
to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity. | ||
|
||
This can happen during a normal failover or automated minor version upgrades. The replaced instance may need some time | ||
to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes. | ||
|
||
At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Check the logs of the affected CloudNativePG instances: | ||
|
||
```bash | ||
kubectl logs --namespace <namespace> pod/<instance-pod-name> | ||
``` | ||
|
||
Check the CloudNativePG operator logs: | ||
|
||
```bash | ||
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg" | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/) | ||
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for | ||
more information on how to troubleshoot and mitigate this issue. |
24 changes: 24 additions & 0 deletions
24
charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
CNPGClusterHighConnectionsCritical | ||
================================== | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 95% of its capacity. | ||
|
||
Impact | ||
------ | ||
|
||
At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service | ||
disruption. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Mitigation | ||
---------- | ||
|
||
* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter. | ||
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database. |
24 changes: 24 additions & 0 deletions
24
charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
CNPGClusterHighConnectionsWarning | ||
================================= | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 85% of its capacity. | ||
|
||
Impact | ||
------ | ||
|
||
At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service | ||
disruption. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Mitigation | ||
---------- | ||
|
||
* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter. | ||
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database. |
31 changes: 31 additions & 0 deletions
31
charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
CNPGClusterHighReplicationLag | ||
============================= | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the replication lag of the CloudNativePG cluster exceed `1s`. | ||
|
||
Impact | ||
------ | ||
|
||
High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data. | ||
In the event of a failover, there may be data loss for the time period of the lag. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
High replication lag can be caused by a number of factors, including: | ||
* Network issues | ||
* High load on the primary or replicas | ||
* Long running queries | ||
* Suboptimal PostgreSQL configuration, in particular small numbers of `max_wal_senders`. | ||
|
||
```yaml | ||
kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;" | ||
``` | ||
|
||
Mitigation | ||
---------- |
28 changes: 28 additions & 0 deletions
28
charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
CNPGClusterInstancesOnSameNode | ||
============================ | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node. | ||
|
||
Impact | ||
------ | ||
|
||
A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
1. Verify you have more than a single node with no taints, preventing pods to be scheduled there. | ||
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration. | ||
3. For more information, please refer to the ["Scheduling"](https://cloudnative-pg.io/documentation/current/scheduling/) section in the documentation |
31 changes: 31 additions & 0 deletions
31
charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
CNPGClusterLowDiskSpaceCritical | ||
=============================== | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either: | ||
|
||
* the PVC hosting the `PGDATA` (`storage` section) | ||
* the PVC hosting WAL files (`walStorage` section), where applicable | ||
* any PVC hosting a tablespace (`tablespaces` section) | ||
|
||
Impact | ||
------ | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result | ||
in downtime and data loss. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Mitigation | ||
---------- | ||
|
||
If you experience issues with the WAL (Write-Ahead Logging) volume and have | ||
set up continuous archiving, ensure that WAL archiving is functioning | ||
correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal` | ||
folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically | ||
ensuring that the number of `ready` files does not increase linearly. |
31 changes: 31 additions & 0 deletions
31
charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
CNPGClusterLowDiskSpaceWarning | ||
============================== | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either: | ||
|
||
* the PVC hosting the `PGDATA` (`storage` section) | ||
* the PVC hosting WAL files (`walStorage` section), where applicable | ||
* any PVC hosting a tablespace (`tablespaces` section) | ||
|
||
Impact | ||
------ | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result | ||
in downtime and data loss. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Mitigation | ||
---------- | ||
|
||
If you experience issues with the WAL (Write-Ahead Logging) volume and have | ||
set up continuous archiving, ensure that WAL archiving is functioning | ||
correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal` | ||
folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically | ||
ensuring that the number of `ready` files does not increase linearly. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
CNPGClusterOffline | ||
================== | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances. | ||
|
||
Impact | ||
------ | ||
|
||
Having an offline cluster means your applications will not be able to access the database, leading to potential service | ||
disruption. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Check the logs of the affected CloudNativePG instances: | ||
|
||
```bash | ||
kubectl logs --namespace <namespace> pod/<instance-pod-name> | ||
``` | ||
|
||
Check the CloudNativePG operator logs: | ||
|
||
```bash | ||
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg" | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/) | ||
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for | ||
more information on how to troubleshoot and mitigate this issue. |
Oops, something went wrong.