Skip to content

Commit

Permalink
Add dead mans switch endpoint (#20)
Browse files Browse the repository at this point in the history
Add dead mans switch endpoint
  • Loading branch information
slok authored Dec 16, 2019
2 parents 3ff2cff + 0559cd3 commit 7b35e3f
Show file tree
Hide file tree
Showing 12 changed files with 570 additions and 35 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

### Added

- Dead man's switch option with Alertmanager.
- Alertmanager API accepts a query string param with a custom chat ID.
- Telegram notifier can send to customized chats.

Expand Down
51 changes: 46 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Alertgram is the easiest way to forward alerts to [Telegram] (Supports [Promethe
## Table of contents

- [Introduction](#introduction)
- [Features](#features)
- [Input alerts](#input-alerts)
- [Options](#options)
- [Run](#run)
Expand All @@ -17,14 +18,27 @@ Alertgram is the easiest way to forward alerts to [Telegram] (Supports [Promethe
- [Metrics](#metrics)
- [Development and debugging](#development-and-debugging)
- [FAQ](#faq)
- [Only alertmanager alerts are supported?](#only-alertmanager-alerts-are-supported-)
- [Where does alertgram listen to alertmanager alerts?](#where-does-alertgram-listen-to-alertmanager-alerts-)
- [Can I notify to different chats?](#can-i-notify-to-different-chats-)
- [Can I use custom templates?](#can-i-use-custom-templates-)
- [Only alertmanager alerts are supported?](#only-alertmanager-alerts-are-supported)
- [Where does alertgram listen to alertmanager alerts?](#where-does-alertgram-listen-to-alertmanager-alerts)
- [Can I notify to different chats?](#can-i-notify-to-different-chats)
- [Can I use custom templates?](#can-i-use-custom-templates)
- [Dead man's switch?](#dead-mans-switch)

## Introduction

Everything started as a way of forwarding [Prometheus alertmanager] alerts to [Telegram] because the solutions that I found were too complex, I just wanted to forward alerts to channels without trouble. And Alertgram is just that, a simple app that forwards alerts to Telegram groups and channels.
Everything started as a way of forwarding [Prometheus alertmanager] alerts to [Telegram] because the solutions that I found were too complex, I just wanted to forward alerts to channels without trouble. And Alertgram is just that, a simple app that forwards alerts to Telegram groups and channels and some small features that help like metrics and dead man's switch.

## Features

- Alertmanager alerts webhook receiver compatibility.
- Telegram notifications.
- Metrics in Prometheus format.
- Optional dead man switch endpoint.
- Optional customizable templates.
- Configurable notification chat ID targets (with fallback to default chat ID).
- Easy setup and flexible.
- Lightweight.
- Perfect for any environment, from a company cluster to home cheap clusters (e.g [K3S]).

## Input alerts

Expand Down Expand Up @@ -117,6 +131,31 @@ To send an alert easily and check the template rendering without an alertmanager
curl -i http://127.0.0.1:8080/alerts -d @./testdata/alerts/base.json
```

### Dead man's switch?

A [dead man's switch][dms] (from now on, DMS) is a technique or process where at regular intervals a signal must be received
so the DMS is disabled, if that signal is not received it will be activated.

In monitoring this would be: If an alert is not received at regular intervals, the switch will be activated and notify that we
are not receiving alerts, this is mostly used to know that our alerting system is working.

For example we would set Prometheus triggering an alert continously, Alertmanager sending this specific alert
every `7m` to the DMS endpoint in Alertgram, and Alertgram would be configured with a `10m` interval DMS.

With this setup if Prometheus fails creating the alert, Alertmanager sending the alert to Alertgram, or Alertgram not receiving
this alert (e.g. network problems), Alertmanager will send an alert to Telegram to notify us that our monitoring system is broken.

You could use the same alertgram or another instance, usually in other machine, cluster... so if the cluster/machine fails, your
is isolated and could notify you.

To Enable Alertgram's DMS use `--dead-mans-switch.enable` to enable. By default it will be listening in `/alert/dms`, with a
`5m` interval and use the telegrams default notifier and chat ID. To customize this settings use:

- `--dead-mans-switch.interval`: To configure the interval.
- `--dead-mans-switch.chat-id`: To configure the notifier chat, is independent of the notifier
although at this moment is Telegram, if not set it will use the notifier default chat target.
- `--alertmanager.dead-mans-switch-path` To configure the path the alertmanager can send the DMS alerts.

[github-actions-image]: https://github.com/slok/alertgram/workflows/CI/badge.svg
[github-actions-url]: https://github.com/slok/alertgram/actions
[goreport-image]: https://goreportcard.com/badge/github.com/slok/alertgram
Expand All @@ -131,3 +170,5 @@ curl -i http://127.0.0.1:8080/alerts -d @./testdata/alerts/base.json
[html go templates]: https://golang.org/pkg/html/template/
[sprig]: http://masterminds.github.io/sprig
[query string]: https://en.wikipedia.org/wiki/Query_string
[k3s]: https://k3s.io/
[dms]: https://en.wikipedia.org/wiki/Dead_man%27s_switch
17 changes: 16 additions & 1 deletion cmd/alertgram/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package main

import (
"os"
"time"

"gopkg.in/alecthomas/kingpin.v2"
)
Expand All @@ -15,12 +16,16 @@ var (
const (
descAMListenAddr = "The listen address where the server will be listening to alertmanager's webhook request."
descAMWebhookPath = "The path where the server will be handling the alertmanager webhook alert requests."
descAMChatIDQS = "The optional query string key used to customize the chat id of the notification."
descAMChatIDQS = "The optional query string key used to customize the chat id of the notification. Does not depend on the notifier type."
descAMDMSPath = "The path for the dead man switch alerts from the Alertmanger."
descTelegramAPIToken = "The token that will be used to use the telegram API to send the alerts."
descTelegramDefChatID = "The default ID of the chat (group/channel) in telegram where the alerts will be sent."
descMetricsListenAddr = "The listen address where the metrics will be being served."
descMetricsPath = "The path where the metrics will be being served."
descMetricsHCPath = "The path where the healthcheck will be being served, it uses the same port as the metrics."
descDMSEnable = "Enables the dead man switch, that will send an alert if no alert is received at regular intervals."
descDMSInterval = "The interval the dead mans switch needs to receive an alert to not send a notification alert (in Go time duration)."
descDMSChatID = "The chat ID (group/channel/room) the dead man's witch will sent the alerts. Does not depend on the notifier type and if not set it will be used notifier default chat ID."
descDebug = "Run the application in debug mode."
descNotifyDryRun = "Dry run the notification and show in the terminal instead of sending."
descNotifyTemplatePath = "The path to set a custom template for the notification messages."
Expand All @@ -30,21 +35,27 @@ const (
defAMListenAddr = ":8080"
defAMWebhookPath = "/alerts"
defAMChatIDQS = "chat-id"
defAMDMSPath = "/alerts/dms"
defMetricsListenAddr = ":8081"
defMetricsPath = "/metrics"
defMetricsHCPath = "/status"
defDMSInterval = "5m"
)

// Config has the configuration of the application.
type Config struct {
AlertmanagerListenAddr string
AlertmanagerWebhookPath string
AlertmanagerChatIDQQueryString string
AlertmanagerDMSPath string
TeletramAPIToken string
TelegramChatID int64
MetricsListenAddr string
MetricsPath string
MetricsHCPath string
DMSInterval time.Duration
DMSEnable bool
DMSChatID string
NotifyTemplate *os.File
DebugMode bool
NotifyDryRun bool
Expand Down Expand Up @@ -75,11 +86,15 @@ func (c *Config) registerFlags() {
c.app.Flag("alertmanager.listen-address", descAMListenAddr).Default(defAMListenAddr).StringVar(&c.AlertmanagerListenAddr)
c.app.Flag("alertmanager.webhook-path", descAMWebhookPath).Default(defAMWebhookPath).StringVar(&c.AlertmanagerWebhookPath)
c.app.Flag("alertmanager.chat-id-query-string", descAMChatIDQS).Default(defAMChatIDQS).StringVar(&c.AlertmanagerChatIDQQueryString)
c.app.Flag("alertmanager.dead-mans-switch-path", descAMDMSPath).Default(defAMDMSPath).StringVar(&c.AlertmanagerDMSPath)
c.app.Flag("telegram.api-token", descTelegramAPIToken).Required().StringVar(&c.TeletramAPIToken)
c.app.Flag("telegram.chat-id", descTelegramDefChatID).Required().Int64Var(&c.TelegramChatID)
c.app.Flag("metrics.listen-address", descMetricsListenAddr).Default(defMetricsListenAddr).StringVar(&c.MetricsListenAddr)
c.app.Flag("metrics.path", descMetricsPath).Default(defMetricsPath).StringVar(&c.MetricsPath)
c.app.Flag("metrics.health-path", descMetricsHCPath).Default(defMetricsHCPath).StringVar(&c.MetricsHCPath)
c.app.Flag("dead-mans-switch.enable", descDMSEnable).BoolVar(&c.DMSEnable)
c.app.Flag("dead-mans-switch.interval", descDMSInterval).Default(defDMSInterval).DurationVar(&c.DMSInterval)
c.app.Flag("dead-mans-switch.chat-id", descDMSChatID).StringVar(&c.DMSChatID)
c.app.Flag("notify.dry-run", descNotifyDryRun).BoolVar(&c.NotifyDryRun)
c.app.Flag("notify.template-path", descNotifyTemplatePath).FileVar(&c.NotifyTemplate)
c.app.Flag("debug", descDebug).BoolVar(&c.DebugMode)
Expand Down
42 changes: 34 additions & 8 deletions cmd/alertgram/main.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package main

import (
"context"
"fmt"
"io/ioutil"
"net/http"
Expand All @@ -14,6 +15,7 @@ import (
"github.com/prometheus/client_golang/prometheus/promhttp"
metricsmiddleware "github.com/slok/go-http-metrics/middleware"

"github.com/slok/alertgram/internal/deadmansswitch"
"github.com/slok/alertgram/internal/forward"
internalhttp "github.com/slok/alertgram/internal/http"
"github.com/slok/alertgram/internal/http/alertmanager"
Expand Down Expand Up @@ -82,22 +84,44 @@ func (m *Main) Run() error {
}
notifier = forward.NewMeasureNotifier(metricsRecorder, notifier)

// Domain services.
forwardSvc := forward.NewService([]forward.Notifier{notifier}, m.logger)
forwardSvc = forward.NewMeasureService(metricsRecorder, forwardSvc)
var g run.Group

// Alertmanager webhook server.
{

// Alert forward.
forwardSvc := forward.NewService([]forward.Notifier{notifier}, m.logger)
forwardSvc = forward.NewMeasureService(metricsRecorder, forwardSvc)

// Dead man's switch.
ctx, ctxCancel := context.WithCancel(context.Background())
var deadMansSwitchSvc deadmansswitch.Service = deadmansswitch.DisabledService // By default disabled.
if m.cfg.DMSEnable {
deadMansSwitchSvc, err = deadmansswitch.NewService(ctx, deadmansswitch.Config{
CustomChatID: m.cfg.DMSChatID,
Notifiers: []forward.Notifier{notifier},
Interval: m.cfg.DMSInterval,
Logger: m.logger,
})
if err != nil {
ctxCancel()
return err
}
}

// API server.
logger := m.logger.WithValues(log.KV{"server": "alertmanager-handler"})
h, err := alertmanager.NewHandler(alertmanager.Config{
Debug: m.cfg.DebugMode,
MetricsRecorder: metricsRecorder,
WebhookPath: m.cfg.AlertmanagerWebhookPath,
Forwarder: forwardSvc,
Logger: logger,
Debug: m.cfg.DebugMode,
MetricsRecorder: metricsRecorder,
WebhookPath: m.cfg.AlertmanagerWebhookPath,
DeadMansSwitchService: deadMansSwitchSvc,
DeadMansSwitchPath: m.cfg.AlertmanagerDMSPath,
ForwardService: forwardSvc,
Logger: logger,
})
if err != nil {
ctxCancel()
return err
}
server, err := internalhttp.NewServer(internalhttp.Config{
Expand All @@ -106,6 +130,7 @@ func (m *Main) Run() error {
Logger: logger,
})
if err != nil {
ctxCancel()
return err
}

Expand All @@ -114,6 +139,7 @@ func (m *Main) Run() error {
return server.ListenAndServe()
},
func(_ error) {
ctxCancel()
if err := server.DrainAndShutdown(); err != nil {
logger.Errorf("error while draining connections")
}
Expand Down
139 changes: 139 additions & 0 deletions internal/deadmansswitch/deadmansswitch.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
package deadmansswitch

import (
"context"
"fmt"
"time"

"github.com/slok/alertgram/internal/forward"
"github.com/slok/alertgram/internal/log"
"github.com/slok/alertgram/internal/model"
)

// Service is a Dead man's switch
//
// A dead man's switch is a process where at regular intervals if some kind of signal is
// not received it will be activated. This usually is used to check that some kind
// of system is working, in this case if we don't receive an alert we assume that something
// is not working and we should notify.
type Service interface {
// PushSwitch will disable the dead man's switch when it's pushed and reset
// the interval for activation.
PushSwitch(ctx context.Context, alertGroup *model.AlertGroup) error
}

// Config is the Service configuration.
type Config struct {
CustomChatID string
Interval time.Duration
Notifiers []forward.Notifier
Logger log.Logger
}

func (c *Config) defaults() error {
if c.Logger == nil {
c.Logger = log.Dummy
}
return nil
}

type service struct {
cfg Config
dmsSwitch chan *model.AlertGroup
notifiers []forward.Notifier
logger log.Logger
}

// NewService returns a Dead mans's switch service.
// When creating a new instance it will start the dead man's switch interval
// it can only stop once and it's done when the received context is done.
func NewService(ctx context.Context, cfg Config) (Service, error) {
err := cfg.defaults()
if err != nil {
return nil, fmt.Errorf("invalid dead man's switch service configuration: %w", err)
}
s := &service{
cfg: cfg,
dmsSwitch: make(chan *model.AlertGroup, 1),
notifiers: cfg.Notifiers,
logger: cfg.Logger.WithValues(log.KV{"service": "deadMansSwitch"}),
}
go s.startDMS(ctx)

return s, nil
}

func (s *service) PushSwitch(_ context.Context, alertGroup *model.AlertGroup) error {
if alertGroup != nil {
s.dmsSwitch <- alertGroup
}
return nil
}

func (s *service) activate(ctx context.Context) error {
dmsNotification := forward.Notification{
ChatID: s.cfg.CustomChatID,
AlertGroup: model.AlertGroup{
ID: "DeadMansSwitchActive",
Alerts: []model.Alert{
model.Alert{
ID: "DeadMansSwitchActive",
Name: "DeadMansSwitchActive",
StartsAt: time.Now(),
Status: model.AlertStatusFiring,
Labels: map[string]string{
"alertname": "DeadMansSwitchActive",
"severity": "critical",
"origin": "alertgram",
},
Annotations: map[string]string{
"message": "The Dead man's switch has been activated! This usually means that your monitoring/alerting system is not working",
},
},
},
},
}

// TODO(slok): Add concurrency using workers.
for _, not := range s.notifiers {
err := not.Notify(ctx, dmsNotification)
if err != nil {
s.logger.WithValues(log.KV{"notifier": not.Type(), "alertGroupID": dmsNotification.AlertGroup.ID}).
Errorf("could not notify alert group: %s", err)
}
}
return nil
}

// startDMS will start the DeadMansSwitch process.
// It will be listening to the signals to know
// that we are alive, if not received in the interval the
// Dead mans switch should assume we are dead and will activate
// this means executing the received function.
func (s *service) startDMS(ctx context.Context) {
logger := s.logger.WithValues(log.KV{"interval": s.cfg.Interval})
logger.Infof("dead man's switch started with an interval of %s", s.cfg.Interval)

for {
select {
case <-ctx.Done():
logger.Infof("context done, stopping dead man's switch")
return
case <-time.After(s.cfg.Interval):
logger.Infof("no switch pushed during interval wait, dead mans switch activated!")
err := s.activate(ctx)
if err != nil {
logger.Errorf("something happened when activating the dead man's switch")
}
case <-s.dmsSwitch:
logger.Debugf("dead mans switch pushed, deactivated")
}
}
}

// DisabledService is a Dead man switch service that doesn't do anything.
const DisabledService = dummyService(0)

type dummyService int

func (dummyService) PushSwitch(ctx context.Context, alertGroup *model.AlertGroup) error { return nil }
Loading

0 comments on commit 7b35e3f

Please sign in to comment.