Defined Type: monitoring::check_prometheus

Defined in:
modules/monitoring/manifests/check_prometheus.pp

Overview

Define monitoring::check_prometheus

Setup an alert based on a Prometheus query. The result of the query must be a scalar to be compared against a threshold.

Usage

Check for load average on a single host. Note that the result must be a scalar to be compared against a threshold, hence scalar().

monitoring::check_prometheus { 'loadavg-1min':
    description    => 'High one minute load average',
    query          => "scalar(node_load1{instance=\"${::hostname}:9100\"})",
    prometheus_url => "http://prometheus.svc.${::site}.wmnet/ops",
    warning        => 5,
    critical       => 10,
}

Another example: compare 1min load average and the number of CPUs, alert if the former exceeds the latter. Note that the expression could be simplified, e.g. by recording the number of CPUs as a different metric: prometheus.io/docs/practices/rules/

monitoring::check_prometheus { 'loadavg-1min-vs-cpus':
    description    => 'load average exceeds the number of CPUs',
    query          => "scalar(node_load1{instance=\"${::hostname}:9100\"}) / scalar(count(node_cpu_seconds_total{instance=\"${::hostname}:9100\",mode=\"idle\"}) by (instance))",
    prometheus_url => "http://prometheus.svc.${::site}.wmnet/ops",
    warning        => 0.7,
    critical       => 1,
}

Parameters

description

Icinga description

query

The prometheus query to run. Note that the result must be a scalar, see also prometheus.io/docs/querying/basics/#expression-language-data-types

prometheus_url

The url to a prometheus server instance.

warning

Warning threshold

critical

Critical threshold

method

Threshold comparison method. One of gt, ge, lt, le, eq, ne

nan_ok

Is NaN considered an OK result?

retries

How many times (IOW, minutes) to retry before considering this check in HARD state.

group

Icinga service group.

ensure

Puppet ensure, absent/present

nagios_critical

Notify via paging if this check fails

contact_group

What contact groups to use for notifications

dashboard_links

Links to the Grafana dashboard for this alarm. URLs must not be URL-encoded as they will be encoded by Icinga.

notes_links

Additional link to add to the Icinga notes_url URLs must not be URL-encoded as they will be encoded by Icinga.

Parameters:

  • description (String)
  • query (String)
  • prometheus_url (Stdlib::HTTPUrl)
  • warning (Numeric)
  • critical (Numeric)
  • dashboard_links (Array[Pattern[/^https:\/\/(grafana|logstash)\.wikimedia\.org/], 1])
  • method (Enum['gt', 'ge', 'lt', 'le', 'eq', 'ne']) (defaults to: 'ge')
  • nan_ok (Boolean) (defaults to: false)
  • check_interval (Integer) (defaults to: 1)
  • retry_interval (Integer) (defaults to: 1)
  • retries (Integer) (defaults to: 5)
  • group (Optional[String]) (defaults to: undef)
  • ensure (Wmflib::Ensure) (defaults to: present)
  • nagios_critical (Boolean) (defaults to: false)
  • contact_group (String) (defaults to: 'admins')
  • notes_link (Stdlib::HTTPUrl) (defaults to: 'https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link')


80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# File 'modules/monitoring/manifests/check_prometheus.pp', line 80

define monitoring::check_prometheus(
    String $description,
    String $query,
    Stdlib::HTTPUrl $prometheus_url,
    Numeric $warning,
    Numeric $critical,
    Array[Pattern[/^https:\/\/(grafana|logstash)\.wikimedia\.org/], 1] $dashboard_links,
    Enum['gt', 'ge', 'lt', 'le', 'eq', 'ne'] $method          = 'ge',
    Boolean $nan_ok          = false,
    Integer $check_interval  = 1,
    Integer $retry_interval  = 1,
    Integer $retries         = 5,
    Optional[String] $group           = undef,
    Wmflib::Ensure $ensure          = present,
    Boolean $nagios_critical = false,
    String $contact_group   = 'admins',
    Stdlib::HTTPUrl $notes_link = 'https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link',
) {
    # don't allow unescaped `!` or ' https://stackoverflow.com/a/28738919/3075306
    if $query =~ /(?<!\\)(?:(\\\\)*)[!]/ {
        fail('All exclamation marks in the query parameter must be escaped e.g. \!')
    }
    if $query =~ /\'/ {
        fail('Query cannot contain single quotes')
    }
    $notes_urls = monitoring::build_notes_url($notes_link, $dashboard_links)

    $command = $nan_ok ? {
        true    => 'check_prometheus_nan_ok',
        default => 'check_prometheus',
    }

    monitoring::service { $title:
        ensure         => $ensure,
        description    => $description,
        check_command  => "${command}!${prometheus_url}!${query}!${warning}!${critical}!${title}!${method}",
        retries        => $retries,
        check_interval => $check_interval,
        retry_interval => $retry_interval,
        group          => $group,
        critical       => $nagios_critical,
        contact_group  => $contact_group,
        notes_url      => $notes_urls,
    }
}