An alerting policy is represented in the Cloud Monitoring API
by an
AlertPolicy
object,
which describes a set of conditions indicating a potentially
unhealthy status in your system.
This document describes the following:
- How the Monitoring API represents alerting policies.
- The types of conditions the Monitoring API provides for
alerting policies.
- How to create an alerting policy by using the Google Cloud CLI or
client libraries.
Structure of an alerting policy
The
AlertPolicy
structure defines the components of an
alerting policy. When you create a policy, you specify values for the
following
AlertPolicy
fields:
You can also specify the
severity
field when you use the Cloud Monitoring API
and the Google Cloud console. This field lets you define the severity level of
incidents. If you don't specify a severity,
then Cloud Monitoring sets the alerting policy severity to
No Severity
.
There are other fields you might use, depending on the conditions you create.
When an alerting policy contains one condition, a notification is sent when
that condition is met. For information about notifications when alerting
policies contain multiple conditions, see
Policies with multiple conditions
and
Number of notifications per policy
.
When you create or modify the alerting policy, Monitoring sets
other fields as well, including the
name
field. The value of the
name
field is the resource name for the alerting policy, which identifies the
policy. The resource name has the following form:
projects/
PROJECT_ID
/alertPolicies/
POLICY_ID
Types of conditions in the API
The Cloud Monitoring API supports a variety of condition types in the
Condition
structure. There are multiple condition
types for metric-based alerting policies, and one for log-based alerting
policies. The following sections describe the available condition types.
Conditions for metric-based alerting policies
To create an alerting policy that monitors metric data, including log-based
metrics, you can use the following condition types:
Filter-based metric conditions
The
MetricAbsence
and
MetricThreshold
conditions use
Monitoring filters
to select the time-series data
to monitor. Other fields in the condition structure specify how to filter,
group, and aggregate the data. For more information on these concepts, see
Filtering and aggregation: manipulating time series
.
If you use the
MetricAbsence
condition type, then you can create a condition
that is met only when all of the time series are absent. This condition uses
the
aggregations
parameter to aggregate multiple time series into a single
time series. For more information, see
the
MetricAbsence
reference in the API documentation.
A metric-absence alerting policy requires that some data has been written
previously; for more information, see
Create metric-absence alerting policies
.
If you want to get notified based on a forecasted value, then configure
your alerting policy to use the
MetricThreshold
condition type and to set the
forecastOptions
field. When
this field isn't set, then the measured data is compared to a threshold.
However, when this field is set, then predicted data is compared to a
threshold. For more information, see
Create forecasted metric-value alerting policies
.
MQL-based metric conditions
The
MonitoringQueryLanguageCondition
condition uses Monitoring Query Language (MQL) to
select and manipulate the time-series data to monitor. You can create alerting
policies that compare values against a threshold or test for the absence
of values with this condition type.
If you use a
MonitoringQueryLanguageCondition
condition, it must be the only
condition in your alerting policy. For more information, see
Alerting policies with MQL
.
PromQL-based metric conditions
The
PrometheusQueryLanguageCondition
condition uses Prometheus Query Language (PromQL)
queries to select and manipulate time-series data to monitor.
Your condition can compute a ratio of metrics,
evaluate metric comparisons, and more.
If you use a
PrometheusQueryLanguageCondition
condition, it must be the only
condition in your alerting policy. For more information, see
Alerting policies with PromQL
.
Conditions for alerting on ratios
You can create metric-threshold alerting policies to monitor the
ratio of two metrics. You can create these policies by using either
the
MetricThreshold
or
MonitoringQueryLanguageCondition
condition type.
You can also use MQL directly in the Google Cloud console. You can't create
or manage ratio-based conditions by using the graphical interface for creating
threshold conditions.
We recommend using MQL to create ratio-based alerting policies.
MQL lets you build more powerful and flexible queries than you can
build by using the
MetricTheshold
condition type and
Monitoring filters.
For example, with a
MonitoringQueryLanguageCondition
condition, you can
compute the ratio of a gauge metric to a delta metric. For examples, see
MQL alerting-policy examples
.
If you use the
MetricThreshold
condition, the numerator and denominator
of the ratio must have the same
MetricKind
.
For a list of metrics and their properties, see
Metric lists
.
In general, it is best to compute ratios based on time series collected for
a single metric type, by using label values. A ratio computed over two
different metric types is subject to anomalies due to different sampling
periods and alignment windows.
For example, suppose that you have two different metric types, an RPC total
count and an RPC error count, and you want to compute the ratio of error-count
RPCs over total RPCs. The unsuccessful RPCs are counted in the time series of
both metric types. Therefore, there is a chance that, when you align the time
series, an unsuccessful RPC doesn't appear in the same alignment interval for
both time series. This difference
can happen for several reasons, including the following:
- Because there are two different time series recording the same event, there
are two underlying counter values implementing the collection, and they
aren't updated atomically.
- The sampling rates might differ. When the time series are aligned to a common
period, the counts for a single event might appear in adjacent alignment
intervals in the time series for the different metrics.
The difference in the number of values in corresponding alignment intervals can
lead to nonsensical
error/total
ratio values like 1/0 or 2/1.
Ratios of larger numbers are less likely to result in nonsensical values.
You can get larger numbers by aggregation, either by using an alignment window
that is
longer than the sampling period, or by grouping data for certain
labels. These techniques minimize the effect of small differences in the
number of points in a given interval. That is, a two-point disparity is more
significant when the expected number of points in an interval is 3 than when
the expected number is 300.
If you are using built-in metric types, then you might have no choice but to
compute ratios across metric types to get the value you need.
If you are designing custom metrics that might count the same thing—like
RPCs returning error status—in two different metrics, consider instead
a single metric, which includes each count only once. For example, suppose
that you are counting RPCs and you want to track the ratio of unsuccessful
RPCs to all RPCs. To solve this problem,
create a single metric type to count RPCs, and use a label to record the
status of the invocation, including the "OK" status. Then each status value,
error or "OK", is recorded by updating a single counter for that case.
Condition for log-based alerting policies
To create a log-based alerting policy, which notifies you when a message
matching your filter appears in your log entries, use the
LogMatch
condition type. If you use a
LogMatch
condition, it must be the only condition in your alerting policy.
Don't try to use the
LogMatch
condition type in conjunction with log-based
metrics. Alerting policies that monitor log-based metrics are metric-based
policies. For more information about choosing between alerting policies that
monitor log-based metrics or log entries, see
Monitoring your logs
.
The alerting policies used in the examples in the
Manage alerting policies by API
document are metric-based
alerting
policies, although the principles are the same for log-based alerting policies.
For information specific to log-based alerting policies, see
Create a log-based alerting policy by using the Monitoring API
in the Cloud Logging documentation.
Before you begin
Before writing code against the API, you should:
- Be familiar with the general concepts and terminology used with alerting
policies; see
Alerting overview
for more
information.
- Ensure that the Cloud Monitoring API is enabled for use; see
Enabling the API
for more information.
- If you plan to use client libraries, then install the libraries for the
languages that you want to use; see
Client Libraries
for details.
Currently, API support
for alerting is available only for C#, Go, Java, Node.js, and Python.
If you plan to use the Google Cloud CLI, then install it.
However, if you use
Cloud Shell
, then Google Cloud CLI is
already installed.
Examples using the
gcloud
interface are also provided here.
Note that the
gcloud
examples all assume that the current project has
already been set as the target (
gcloud config set project [PROJECT_ID]
)
so invocations omit the explicit
--project
flag. The ID
of the current project in the examples is
a-gcp-project
.
-
To get the permissions that you need to create and modify alerting policies by using the Cloud Monitoring API,
ask your administrator to grant you the
Monitoring AlertPolicy Editor
(
roles/monitoring.alertPolicyEditor
) IAM role on your project.
For more information about granting roles, see
Manage access
.
You might also be able to get
the required permissions through
custom
roles
or other
predefined
roles
.
For detailed information about IAM roles for
Monitoring, see
Control access with Identity and Access Management
.
Design your application to single-thread Cloud Monitoring API calls that
modify the state of an alerting policy in a
Google Cloud project. For example, single-thread API calls that create, update,
or delete an alerting policy.
Create an alerting policy
To create an alerting policy in a project, use the
alertPolicies.create
method. For information about how to invoke this
method, its parameters, and the response data, see the reference page
alertPolicies.create
.
You can create policies from JSON or YAML files.
The Google Cloud CLI accepts these files as arguments, and
you can programmatically read JSON files, convert them to
AlertPolicy
objects, and create policies from them
by using the
alertPolicies.create
method. If you
have a Prometheus JSON or YAML configuration file with an alerting rule, then
the gcloud CLI can migrate it to a Cloud Monitoring alerting
policy with a PromQL condition. For more information, see
Migrate alerting rules and receivers from Prometheus
.
Each alerting policy belongs to a scoping project of a metrics scope. Each
project can contain up to 500 policies.
For API calls, you must provide a “project ID”; use the
ID of the scoping project of a metrics scope as the value. In these examples,
the ID of the scoping project of a metrics scope is
a-gcp-project
.
The following samples illustrate the creation of alerting policies, but they
don't describe how to create a JSON or YAML file that describes
an alerting policy. Instead, the samples assume that a JSON-formatted file
exists and they illustrate how to issue the API call. For example JSON files,
see
Sample policies
.
For general information about monitoring ratios of metrics, see
Ratios of metrics
.
gcloud
To create an alerting policy in a project, use the
gcloud alpha monitoring
policies create
command. The following example creates an alerting policy in
a-gcp-project
from the
rising-cpu-usage.json
file:
gcloud alpha monitoring policies create --policy-from-file="rising-cpu-usage.json"
If successful, this command returns the name of the new policy, for example:
Created alert policy [projects/a-gcp-project/alertPolicies/12669073143329903307].
The file
rising-cpu-usage.json
file contains the JSON for a policy with
the display name “High CPU rate of change”. For details about this policy, see
Rate-of-change policy
.
See the
gcloud alpha monitoring policies create
reference for more information.
The created
AlertPolicy
object will have additional fields.
The policy itself will have
name
,
creationRecord
, and
mutationRecord
fields. Additionally, each condition in the policy is also given a
name
.
These fields cannot be modified externally, so there is no need to set them
when creating a policy. None of the JSON examples used for creating
policies include them, but if policies created from them are retrieved after
creation, the fields will be present.
Configure repeated notifications for metric-based alerting policies
By default, a metric-based alerting policy sends one notification to each
notification
channel when an incident is opened. However, you can change the default
behavior and configure an alerting policy to resend notifications to all or
some of the notification channels for your alerting policy.
These repeated notifications are sent for incidents
with a status of Open or Acknowledged. The interval between these notifications
must be at least 30 minutes
and no more than 24 hours, expressed in seconds.
To configure repeated notifications, add to the alerting policy's configuration
an
AlertStrategy
object that contains at least one
NotificationChannelStrategy
object.
A
NotificationChannelStrategy
object has two fields:
renotifyInterval
: The interval, in seconds, between repeated
notifications.
If you change the value of the
renotifyInterval
field when
an incident for the alerting policy is opened, then the following happens:
- The alerting policy sends out another
notification for the incident.
- The alerting policy restarts the interval period.
notificationChannelNames
: An array of notification channel resource names,
which are strings in the format of
projects/
PROJECT_ID
/notificationChannels/
CHANNEL_ID
, where
CHANNEL_ID
is a numeric value.
For information about how to retrieve the channel ID, see
List notification channels in a project
.
For example, the following JSON sample shows an alert strategy configured
to send repeated notifications every 1800 seconds (30 minutes)
to one notification channel:
"alertStrategy": {
"notificationChannelStrategy": [
{
"notificationChannelNames": [
"projects/
PROJECT_ID
/notificationChannels/
CHANNEL_ID
"
],
"renotifyInterval": "1800s"
}
]
}
To temporarily stop repeated notifications, create a
snooze
. To
prevent repeated notifications, edit the alerting policy by using the API and
remove the
NotificationChannelStrategy
object.
What's next