This document describes sources of diagnostic information that you can use to
identify problems in the installation or running of the Ops Agent.
Agent health checks
Version 2.25.1 introduced start-time health checks
for the Ops Agent.
When the Ops Agent starts, it performs a series of checks for conditions
that prevent the agent from running correctly. If the agent detects
one of the conditions, it logs a message describing the problem.
The Ops Agent checks for the following:
- Connectivity problems
- Availability of ports used by the agent to report metrics about itself
- Permission problems
- Availability of the APIs used by the agent to write logs or metrics
- A problem in the health-check routine itself.
For information about locating start-time errors, see
Find start-time errors
.
Version 2.37.0 introduced runtime heath checks
for the Ops Agent.
These errors are reported to Cloud Logging and Error Reporting.
For information about locating runtime errors, see
Find runtime errors
.
Version 2.46.0 introduced the informational
LogPingOpsAgent
code. This code does not represent an error.
For more information, see
Verify successful log collection
.
The following table lists each health-check code in alphabetical order and
describes what each code means. Codes that end with the string
Err
indicate errors; other codes are informational.
Health-check code
|
Category
|
Meaning
|
Suggestion
|
DLApiConnErr
|
Connectivity
|
Request to the downloads subdomain,
dl.google.com
, failed.
|
Check your internet connection and firewall rules.
For more information, see
Network-connectivity issues
.
|
FbMetricsPortErr
|
Port availability
|
Port 20202, needed for Ops Agent self metrics, is unavailable.
|
Verify that port 20202 is open. For more information, see
Required port is unavailable
.
|
HcFailureErr
|
Generic
|
The Ops Agent health-check routine encountered an internal error.
|
Submit a support case from the Google Cloud console.
For more information, see
Getting support
.
|
LogApiConnErr
|
Connectivity
|
Request to the Logging API failed.
|
Check your internet connection and firewall rules.
For more information, see
Network-connectivity issues
.
|
LogApiDisabledErr
|
API
|
The Logging API is disabled in the current Google Cloud project.
|
Enable the Logging API
.
|
LogApiPermissionErr
|
Permission
|
Service account is missing the Logs Writer role
(
roles/logging.logWriter
).
|
Grant the
Logs Writer role
to the service account. For more information, see
Agent lacks API permissions
.
|
LogApiScopeErr
|
Permission
|
The VM is missing the https://www.googleapis.com/auth/logging.write
access scope.
|
Add the https://www.googleapis.com/auth/logging.write scope to the VM.
For more information, see
Verify your access scopes
.
|
LogApiUnauthenticatedErr
|
API
|
The current VM couldn't authenticate to the Logging API.
|
Verify that your credential files, VM access scopes, and permissions
are set up correctly. For more information, see
Authorize the Ops Agent
.
|
LogPingOpsAgent
|
|
An informational payload message written every 10 minutes to the
ops-agent-health
log. You can use the resulting log
entries to verify that the agent is sending logs. This message is not
an error.
|
This message is expected to appear every 10 minutes. If the message
does not appear for 20 minutes or longer, then agent might have
encountered a problem. For troubleshooting information, see
Troubleshoot the Ops Agent
.
|
LogParseErr
|
Runtime
|
The Ops Agent was unable to parse one or more logs.
|
Check the configuration of any logging processors you've created.
For more information see
Log-parsing errors
.
|
LogPipeLineErr
|
Runtime
|
The Ops Agent's logging pipeline failed.
|
Verify that the agent has access to the buffer files; check for a
full disk, and verify that the Ops Agent configuration is correct.
For more information, see
Pipeline errors
.
|
MetaApiConnErr
|
Connectivity
|
Request to the GCE Metadata server, for querying VM access scopes,
OAuth tokens, and resource labels, failed.
|
Check your internet connection and firewall rules.
For more information, see
Network-connectivity issues
.
|
MonApiConnErr
|
Connectivity
|
A request to the Monitoring API failed.
|
Check your internet connection and firewall rules.
For more information, see
Network-connectivity issues
.
|
MonApiDisabledErr
|
API
|
The Monitoring API is disabled in the current Google Cloud project.
|
Enable the Monitoring API
.
|
MonApiPermissionErr
|
Permission
|
Service account is missing the Monitoring Metric Writer role
(
roles/monitoring.metricWriter
).
|
Grant the
Monitoring Metric Writer role
to the service account. For more information, see
Agent lacks API permissions
.
|
MonApiScopeErr
|
Permission
|
The VM is missing the https://www.googleapis.com/auth/monitoring.write
access scope.
|
Add the https://www.googleapis.com/auth/monitoring.write scope to the VM.
For more information, see
Verify your access scopes
.
|
MonApiUnauthenticatedErr
|
API
|
The current VM couldn't authenticate to the Monitoring API.
|
Verify that your credential files, VM access scopes, and permissions
are set up correctly. For more information, see
Authorize the Ops Agent
.
|
OtelMetricsPortErr
|
Port availability
|
Port 20201, needed for Ops Agent self metrics, is unavailable.
|
Verify that port 20201 is open.
For more information, see
A required port is unavailable
.
|
PacApiConnErr
|
Connectivity
|
This health-check code is unreliable. This code is disabled in Ops Agent version 2.46.1.
|
Update to version Ops Agent version 2.46.1 or above.
|
Find start-time errors
Starting with version 2.35.0, health-check
information is written to the
ops-agent-health
log by the Cloud Logging API
(versions 2.33.0, 2.34.0 use
ops-agent-health-checks
).
The same information is also written to a
health-checks.log
file as follows:
- Linux
:
/var/log/google-cloud-ops-agent/health-checks.log
- Windows
:
C:\ProgramData\Google\Cloud Operations\Ops
Agent\log\health-checks.log
You can also view any health-check messages by querying the
status of the Ops Agent service as follows:
After you resolve any problems, you must
restart the agent
.
The health checks are run when the agent starts, so to re-run the
checks, you must restart the agent.
Find runtime errors
The runtime health checks are reported to both Cloud Logging
and Error Reporting.
If the agent failed to start but was able to report errors before failing,
you might also see start-time errors reported.
To view runtime errors from the Ops Agent in Logging, do the
following:
-
In the Google Cloud console, go to the
Logs Explorer
page:
Go to
Logs Explorer
If you use the search bar to find this page, then select the result whose subheading is
Logging
.
- Enter the following query and click
Run query
:
log_id("ops-agent-health")
To view runtime errors from the Ops Agent in Error Reporting,
do the following:
-
In the Google Cloud console, go to the
Error Reporting
page:
Go to
Error Reporting
You can also find this page by using the search bar.
- To see errors from the Ops Agent, filter the errors for
Ops Agent
.
Verify successful log collection
Version 2.46.0 of the Ops Agent introduced the
informational
LogPingOpsAgent
health check. This check writes an
informational message to the
ops-agent-health
every 10 minutes.
You can use the presence of these messages to verify that the Ops Agent is
writing logs by doing any of the following:
If any of these options indicates that the log messages are not being
ingested, then you can do the following:
To check the status of the Ops Agent on a specific VM, you need the
instance ID of the VM. To find the instance ID, do the following:
-
In the Google Cloud console, go to the
VM instances
page:
Go to
VM instances
If you use the search bar to find this page, then select the result whose subheading is
Compute Engine
.
- Click the name of a VM instance.
- On the
Details
tab, locate the
Basic information
section.
The instance ID appears as a numeric string. Use this string for the
INSTANCE_ID
value in the subsequent sections.
Search for messages by using Logs Explorer
To use Logs Explorer to search the logs of a VM for the ping messages,
do the following:
-
In the Google Cloud console, go to the
Logs Explorer
page:
Go to
Logs Explorer
If you use the search bar to find this page, then select the result whose subheading is
Logging
.
- To look for ping messages from the Ops Agent on a specific VM instance,
enter the following query and replace
INSTANCE_ID
with the
identifier for a Compute Engine VM, then click
Run query
:
resource.type="gce_instance"
resource.labels.instance_id="
INSTANCE_ID
"
log_id("ops-agent-health")
jsonPayload.code="LogPingOpsAgent"
View the
log_entry_count
metric
To use Metrics Explorer to check the value of the metric
log_entry_count
for a VM, do the following:
-
In the Google Cloud console, go to the
leaderboard
Metrics explorer
page:
Go to
Metrics explorer
If you use the search bar to find this page, then select the result whose subheading is
Monitoring
.
- In the
Select a metric
field, do the following:
- Enter
log entries
.
- For the
Resource type
, select
VM Instance
.
- For the
Metric category
, select
Logs-based metrics
.
- For the
Metric
, select
Log entries
.
- Select
Apply
.
- In the
Filter
field, add the following filters:
- Filter for a specific VM's instance ID:
- Select the resource label
instance_id
.
- Select the comparator
= (equals)
.
- Enter the
INSTANCE_ID
of a VM.
- Filter for the
ops-agent-health
log:
- Select the resource label
log
.
- Select the comparator
= (equals)
.
- Select the value
ops-agent-heath
.
Create an alerting policy for the
log_entry_count
metric
To create an alerting policy that monitors the value of the
log_entry_count
metric for log pings from a specific VM,
do the following:
-
In the Google Cloud console, go to the
notifications
Alerting
page:
Go to
Alerting
If you use the search bar to find this page, then select the result whose subheading is
Monitoring
.
- If you haven't created your notification channels and if you want to be
notified, then click
Edit Notification Channels
and add your notification channels.
Return to the
Alerting
page after you add your channels.
- From the
Alerting
page, select
Create policy
.
- In the
Select a metric
field, do the following:
- Enter
log entries
.
- For the
Resource type
, select
VM Instance
.
- For the
Metric category
, select
Logs-based metrics
.
- For the
Metric
, select
Log entries
.
- Select
Apply
.
- In the
Filter
field, add the following filters:
- Filter for a specific VM's instance ID:
- Select the resource label
instance_id
.
- Select the comparator
= (equals)
.
- Enter the
INSTANCE_ID
of a VM.
- Filter for the
ops-agent-health
log:
- Select the resource label
log
.
- Select the comparator
= (equals)
.
- Select the value
ops-agent-heath
.
- In the
Transform data
section, select the following:
- For the
Rolling window
field, select
10 min
. To detect
missing log entries over a longer period, enter a larger value.
- For the
Rolling window function
field, select
delta
.
- Click
Next
.
- The settings in the
Configure alert trigger
page determine when
the alert is triggered. Complete this page with the settings in the
following table.
Configure alert trigger
page
Field
|
Value
|
Condition type
|
Threshold
|
Alert trigger
|
Any time series violates
|
Threshold position
|
Below threshold
|
Threshold value
|
1
|
Advanced Options: Retest window
|
No retest
|
- Click
Next
.
- Optional: To add notifications to your alerting policy, click
Notification channels
. In the dialog, select one or more
notification channels from the menu, and then click
OK
.
- Optional: Update the
Incident autoclose duration
. This field
determines when Monitoring closes incidents in the
absence of metric data.
- Optional: Click
Documentation
, and then add any information that
you want included in a notification message.
- Click
Alert name
and enter a name for the alerting policy.
- Click
Create Policy
.
For more information, see
Alerting policies
.
Agent diagnostics tool for VMs
The agent diagnostics tool gathers critical local debugging information from
your VMs for all the following agents: Ops Agent, legacy
Logging agent, and legacy Monitoring agent. The
debugging information includes things like project info, VM info, agent
configuration, agent logs, agent service status, information that typically
requires manual work to gather. The tool also checks the local VM environment to
ensure it meets certain requirements for the agents to function properly, for
example, network connectivity and required permissions.
When filing a customer case for an agent on a VM, run the agent
diagnostics tool and attach the collected information to the case.
Providing this information reduces the time needed to troubleshoot your
support case. Before you attach the information to the support case,
redact any sensitive information like passwords.
The agent diagnostics tool must be run from inside the VM, so you will
typically need to SSH into the VM first. The following command retrieves the
agent diagnostics tool and executes it:
Linux
curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh
Windows
(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/diagnose-agents.ps1", "${env:UserProfile}\diagnose-agents.ps1")
Invoke-Expression "${env:UserProfile}\diagnose-agents.ps1"
Follow the output of the script execution to locate the files that include the
collected info. Typically you can find them in the
/var/tmp/google-agents
directory on Linux and in the
$env:LOCALAPPDATA/Temp
directory on Windows,
unless you have customized the output directory when running the script.
For detailed information, examine the
diagnose-agents.sh
script on Linux or
diagnose-agents.ps1
script on Windows.
If an attempt to install the Ops Agent by using an Ops Agent OS policy
fails, you can use the diagnostics script described in this section for
debugging. For example, you might see one of the following cases:
- The Ops Agent installation fails when you used the
Install Ops Agent for Monitoring and Logging
checkbox
to
install the Ops Agent during VM creation
.
The agent status on the Cloud Monitoring
VM instances
dashboard
or the
Observability
tab on a Compute Engine VM details page
stays in the
Pending
state for more than 10 minutes.
A prolonged
Pending
status might indicate one of the following:
- A problem applying the policy.
- A problem in the actual installation of the Ops Agent.
- A connectivity problem between the VM and Cloud Monitoring.
For some of these issues, the general
agent-diagnostics
script
and
health checks
might also be helpful.
To run the policy-diagnostics script, run the following commands:
curl -sSO https://dl.google.com/cloudagents/diagnose-ui-policies.sh
bash diagnose-ui-policies.sh
VM_NAME
VM_ZONE
This script shows information about affected VMs and related automatic
installation policies.
When filing a customer case for an agent on a VM, run the agent
diagnostics tools and attach the collected information to the case.
Providing this information reduces the time needed to troubleshoot your
support case. Before you attach the information to the support case,
redact any sensitive information like passwords.
Agent status
You can check the status of the Ops Agent processes on the VM to determine
if the agent is running or not.
Linux
To check the status of the Ops Agent, use the following command:
sudo systemctl status google-cloud-ops-agent"*"
Verify that the "Metrics Agent" and "Logging Agent" components are listed
as "active (running)", as shown in the following sample output (some lines
have been removed for brevity):
● google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2023-05-03 21:22:28 UTC; 4 weeks 0 days ago
Process: 3353828 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/go>
Process: 3353837 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 3353837 (code=exited, status=0/SUCCESS)
CPU: 195ms
[...]
● google-cloud-ops-agent-opentelemetry-collector.service -
Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
Active:
active (running)
since Wed 2023-05-03 21:22:29 UTC; 4 weeks 0 days ago
Process: 3353840 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=ot>
Main PID: 3353855 (otelopscol)
Tasks: 9 (limit: 2355)
Memory: 65.3M
CPU: 40min 31.555s
CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
└─3353855 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/g>
[...]
● google-cloud-ops-agent-fluent-bit.service -
Google Cloud Ops Agent - Logging Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static)
Active:
active (running)
since Wed 2023-05-03 21:22:29 UTC; 4 weeks 0 days ago
Process: 3353838 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fl>
Main PID: 3353856 (google_cloud_op)
Tasks: 31 (limit: 2355)
Memory: 58.3M
CPU: 29min 6.771s
CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
├─3353856 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_wrapper -config_path /etc/goo>
└─3353872 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-clo>
[...]
● google-cloud-ops-agent-diagnostics.service - Google Cloud Ops Agent - Diagnostics
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-diagnostics.service; disabled; vendor preset: e>
Active: active (running) since Wed 2023-05-03 21:22:26 UTC; 4 weeks 0 days ago
Main PID: 3353819 (google_cloud_op)
Tasks: 8 (limit: 2355)
Memory: 36.0M
CPU: 3min 19.488s
CGroup: /system.slice/google-cloud-ops-agent-diagnostics.service
└─3353819 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_diagnostics -config /etc/goog>
[...]
Windows
To check the status of the Ops Agent, use the following command:
Get-Service google-cloud-ops-agent*
Verify that the "Metrics Agent" and "Logging Agent" components are listed
as "Running", as shown in the following sample output:
Status Name DisplayName
------ ---- -----------
Running google-cloud-op... Google Cloud Ops Agent
Running google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running google-cloud-op... Google Cloud Ops Agent - Metrics Agent
Running google-cloud-op... Google Cloud Ops Agent - Diagnostics
Agent self logs
If the agent fails to ingest logs to Cloud Logging, then you might have to
inspect the agent's logs locally on the VM for troubleshooting. You can
also
use log rotation
to manage the agent's self logs.
Linux
To inspect self logs that are written to
Journald
, run the following command:
journalctl -u google-cloud-ops-agent*
To inspect the self logs that are written to the disk by the logging module, run
the following command:
vim -M /var/log/google-cloud-ops-agent/subagents/logging-module.log
Windows
To inspect self logs that are written to
Windows Event Logs
, run the following
command:
Get-WinEvent -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent*' } | Format-Table -AutoSize -Wrap
To inspect the self logs that are written to the disk by the logging module, run
the following command:
notepad "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"
To inspect the logs from the
Windows Service Control Manager
for Ops Agent
services, run the following command::
Get-WinEvent -FilterHashtable @{ Logname='System'; ProviderName='Service Control Manager' } | Where-Object -Property Message -Match 'Google Cloud Ops Agent' | Format-Table -AutoSize -Wrap
View metric usage and diagnostics in Cloud Monitoring
The Cloud Monitoring
Metrics Management
page provides information
that can help you control the amount you spend on chargeable metrics
without affecting observability. The
Metrics Management
page reports the
following information:
- Ingestion volumes for both byte- and sample-based billing, across metric
domains and for individual metrics.
- Data about labels and cardinality of metrics.
- Use of metrics in alerting policies and custom dashboards.
- Rate of metric-write errors.
To view the
Metrics Management
page, do the following:
-
In the Google Cloud console, go to the
query_stats
Metrics management
page:
Go to
Metrics management
If you use the search bar to find this page, then select the result whose subheading is
Monitoring
.
- In the toolbar, select your time window. By default, the
Metrics Management
page displays information about the metrics collected
in the previous one day.
For more information about the
Metrics Management
page, see
View and manage metric usage
.