The NVIDIA Data Center GPU Manager integration collects key advanced GPU
metrics from DCGM, including
Streaming Multiprocessor (SM) block utilization, SM occupancy,
SM pipe utilization, PCIe traffic rate, and NVLink traffic rate. For
information about the purpose and interpretation of these metrics,
see
Profiling Metrics
in the DCGM feature overview.
For more information about the NVIDIA Data Center GPU Manager, see the
DCGM
documentation
.
This integration is compatible with DCGM version 3.1 and later.
The Ops Agent collects DCGM metrics by using NVIDIA's client library
go-dcgm
.
These metrics are available for Linux systems only.
Metrics are not collected from NVIDIA GPU models K80, P100, and P4.
Prerequisites
To collect DCGM metrics, you must do the following:
Install DCGM and verify installation
You must install a DCGM version 3.1 and later
and ensure that it runs as a privileged service.
To install DCGM, see
Installation
in the DCGM documentation.
To verify that DCGM is running correctly, do the following:
Check the status of the DCGM service by running the following command:
sudo service nvidia-dcgm status
If the service is running, the
nvidia-dcgm
service is
listed as
active (running)
. The output resembles the following:
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s ago
Main PID: 24388 (nv-hostengine)
Tasks: 7 (limit: 14745)
CGroup: /system.slice/nvidia-dcgm.service
└─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Verify that the GPU devices are found by running the following command:
dcgmi discovery --list
If devices are found, the output resembles the following:
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:00:04.0 |
| | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957 |
+--------+----------------------------------------------------------------------+
Following the guide for
Configuring the Ops
Agent
,
add the required elements to collect telemetry from your DCGM service, and
restart the agent
.
Example configuration
The following commands create the configuration to collect and ingest telemetry
for DCGM and restart the Ops Agent:
# Configures Ops Agent to collect telemetry from the app and restart Ops Agent.
set -e
# Create a back up of the existing file so existing configurations are not lost.
sudo cp /etc/google-cloud-ops-agent/config.yaml /etc/google-cloud-ops-agent/config.yaml.bak
# Configure the Ops Agent.
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null
< eof="" metrics:="" receivers:="" dcgm:="" type:="" dcgm="" service:="" pipelines:="" dcgm:="" receivers:="" -="" dcgm="" eof="" sudo="" systemctl="" restart="" google-cloud-ops-agent="">
After running these commands, you can
check that the agent restarted. Run the following command and verify that
the sub-agent components "Metrics Agent" and "Logging Agent" are listed as
"active (running)":
sudo systemctl status google-cloud-ops-agent"*"
If you are using custom service account instead of the default
Compute Engine service account, or if you have a
very old Compute Engine VM, then you might need to
authorize the Ops Agent
.
Configure metrics collection
To ingest metrics from DCGM, you must create a receiver for the metrics
that DCGM produces and then create a pipeline for the new receiver.
This receiver does not
support the use of multiple instances in the configuration, for example, to
monitor multiple endpoints. All such instances write to the same time series,
and Cloud Monitoring has no way to distinguish among them.
To configure a receiver for your
dcgm
metrics, specify the following
fields:
Field
|
Default
|
Description
|
collection_interval
|
60s
|
A
time duration
, such as
30s
or
5m
.
|
endpoint
|
localhost:5555
|
Address of the DCGM service, formatted as
host:port
.
|
type
|
|
This value must be
dcgm
.
|
What is monitored
The following table provides the list of metrics that the Ops Agent collects
from the DCGM service. Not all metrics are available for all GPU models.
Metrics are not collected from NVIDIA GPU models K80, P100, and P4.
Metric type
|
Kind, Type
Monitored resources
|
Labels
|
Supported
GPU models
|
workload.googleapis.com/dcgm.gpu.profiling.dram_utilization
|
GAUGE
,
DOUBLE
gce_instance
|
gpu_number
model
uuid
|
All except K80, P100, and P4
|
workload.googleapis.com/dcgm.gpu.profiling.nvlink_traffic_rate
|
GAUGE
,
INT64
gce_instance
|
direction
gpu_number
model
uuid
|
All except K80, P100, and P4
|
workload.googleapis.com/dcgm.gpu.profiling.pcie_traffic_rate
|
GAUGE
,
INT64
gce_instance
|
direction
gpu_number
model
uuid
|
All except K80, P100, and P4
|
workload.googleapis.com/dcgm.gpu.profiling.pipe_utilization
|
GAUGE
,
DOUBLE
gce_instance
|
gpu_number
model
pipe
uuid
|
All except K80, P100, and P4. For L4, the
pipe
value
fp64
is not supported.
|
workload.googleapis.com/dcgm.gpu.profiling.sm_occupancy
|
GAUGE
,
DOUBLE
gce_instance
|
gpu_number
model
uuid
|
All except K80, P100, and P4
|
workload.googleapis.com/dcgm.gpu.profiling.sm_utilization
|
GAUGE
,
DOUBLE
gce_instance
|
gpu_number
model
uuid
|
All except K80, P100, and P4
|
In addition, the built-in configuration for the Ops Agent
also collects
agent.googleapis.com/gpu
metrics
, which
are reported by the NVIDIA
Management Library (NVML)
.
You do not need any additional configuration in the Ops Agent to collect
these metrics, but you must
create your VM with attached GPUs
and
install the GPU driver
. For more information, see
About the
gpu
metrics
.
Verify the configuration
This section describes how to verify that you correctly configured the
NVIDIA DCGM receiver. It might take one or two
minutes for the Ops Agent to begin collecting telemetry.
To verify that NVIDIA DCGM metrics are being sent to
Cloud Monitoring, do the following:
-
In the Google Cloud console, go to the
leaderboard
Metrics explorer
page:
Go to
Metrics explorer
If you use the search bar to find this page, then select the result whose subheading is
Monitoring
.
- In the toolbar of the
query-builder pane, select the button whose name is either
code
MQL
or
code
PromQL
.
- Verify that
MQL
is selected
in the
Language
toggle. The language toggle is in the same toolbar that
lets you format your query.
- Enter the following query in the editor, and then click
Run query
:
fetch gce_instance
| metric 'workload.googleapis.com/dcgm.gpu.profiling.sm_utilization'
| every 1m
View dashboard
To view your NVIDIA DCGM metrics, you must have a chart or dashboard
configured.
The NVIDIA DCGM integration includes one or more dashboards for you.
Any dashboards are automatically installed after you configure the
integration and the Ops Agent has begun collecting metric data.
You can also view static previews of dashboards without
installing the integration.
To view an installed dashboard, do the following:
-
In the Google Cloud console, go to the
Dashboards
page:
Go to
Dashboards
If you use the search bar to find this page, then select the result whose subheading is
Monitoring
.
- Select the
Dashboard List
tab, and then choose the
Integrations
category.
- Click the name of the dashboard you want to view.
If you have configured an integration but the dashboard has not been
installed, then check that the Ops Agent is running. When there is no
metric data for a chart in the dashboard, installation of the dashboard fails.
After the Ops Agent begins collecting metrics, the dashboard is installed
for you.
To view a static preview of the dashboard, do the following:
-
In the Google Cloud console, go to the
Integrations
page:
Go to
Integrations
If you use the search bar to find this page, then select the result whose subheading is
Monitoring
.
- Click the
Compute Engine
deployment-platform filter.
- Locate the entry for
NVIDIA DCGM
and click
View Details
.
- Select the
Dashboards
tab to see a static preview. If the
dashboard is installed, then you can navigate to it by clicking
View dashboard
.
For more information about dashboards in Cloud Monitoring, see
Dashboards and charts
.
For more information about using the
Integrations
page, see
Manage integrations
.
DCGM limitations, and pausing profiling
Concurrent usage of DCGM can conflict with usage of some other
NVIDIA developer tools, such as Nsight Systems or Nsight Compute.
This limitation applies to NVIDIA A100 and earlier GPUs. For more
information, see
Profiling Sampling Rate
in the DCGM feature overiew.
When you need to use tools like Nsight Systems without significant disruption,
you can temporarily pause or resume the metrics collection by using the
following commands:
dcgmi profile --pause
dcgmi profile --resume
When profiling is paused, none of the DCGM metrics that the Ops Agent
collects are emitted from the VM.
What's next
For a walkthrough on how to use Ansible to install the Ops Agent, configure a
third-party application, and install a sample dashboard, see the
Install the Ops Agent to troubleshoot third-party applications
video.