Find Ops Agent troubleshooting information

This document describes sources of diagnostic information that you can use to identify problems in the installation or running of the Ops Agent.

Agent health checks

Version 2.25.1 introduced start-time health checks for the Ops Agent. When the Ops Agent starts, it performs a series of checks for conditions that prevent the agent from running correctly. If the agent detects one of the conditions, it logs a message describing the problem. The Ops Agent checks for the following:

Connectivity problems
Availability of ports used by the agent to report metrics about itself
Permission problems
Availability of the APIs used by the agent to write logs or metrics
A problem in the health-check routine itself.

For information about locating start-time errors, see Find start-time errors .

Version 2.37.0 introduced runtime heath checks for the Ops Agent. These errors are reported to Cloud Logging and Error Reporting. For information about locating runtime errors, see Find runtime errors .

Version 2.46.0 introduced the informational LogPingOpsAgent code. This code does not represent an error. For more information, see Verify successful log collection .

The following table lists each health-check code in alphabetical order and describes what each code means. Codes that end with the string Err indicate errors; other codes are informational.

Health-check code	Category	Meaning	Suggestion
`DLApiConnErr`	Connectivity	Request to the downloads subdomain, `dl.google.com`, failed.	Check your internet connection and firewall rules. For more information, see Network-connectivity issues .
`FbMetricsPortErr`	Port availability	Port 20202, needed for Ops Agent self metrics, is unavailable.	Verify that port 20202 is open. For more information, see Required port is unavailable .
`HcFailureErr`	Generic	The Ops Agent health-check routine encountered an internal error.	Submit a support case from the Google Cloud console. For more information, see Getting support .
`LogApiConnErr`	Connectivity	Request to the Logging API failed.	Check your internet connection and firewall rules. For more information, see Network-connectivity issues .
`LogApiDisabledErr`	API	The Logging API is disabled in the current Google Cloud project.	Enable the Logging API .
`LogApiPermissionErr`	Permission	Service account is missing the Logs Writer role ( `roles/logging.logWriter`).	Grant the Logs Writer role to the service account. For more information, see Agent lacks API permissions .
`LogApiScopeErr`	Permission	The VM is missing the https://www.googleapis.com/auth/logging.write access scope.	Add the https://www.googleapis.com/auth/logging.write scope to the VM. For more information, see Verify your access scopes .
`LogApiUnauthenticatedErr`	API	The current VM couldn't authenticate to the Logging API.	Verify that your credential files, VM access scopes, and permissions are set up correctly. For more information, see Authorize the Ops Agent .
`LogPingOpsAgent`		An informational payload message written every 10 minutes to the `ops-agent-health` log. You can use the resulting log entries to verify that the agent is sending logs. This message is not an error.	This message is expected to appear every 10 minutes. If the message does not appear for 20 minutes or longer, then agent might have encountered a problem. For troubleshooting information, see Troubleshoot the Ops Agent .
`LogParseErr`	Runtime	The Ops Agent was unable to parse one or more logs.	Check the configuration of any logging processors you've created. For more information see Log-parsing errors .
`LogPipeLineErr`	Runtime	The Ops Agent's logging pipeline failed.	Verify that the agent has access to the buffer files; check for a full disk, and verify that the Ops Agent configuration is correct. For more information, see Pipeline errors .
`MetaApiConnErr`	Connectivity	Request to the GCE Metadata server, for querying VM access scopes, OAuth tokens, and resource labels, failed.	Check your internet connection and firewall rules. For more information, see Network-connectivity issues .
`MonApiConnErr`	Connectivity	A request to the Monitoring API failed.	Check your internet connection and firewall rules. For more information, see Network-connectivity issues .
`MonApiDisabledErr`	API	The Monitoring API is disabled in the current Google Cloud project.	Enable the Monitoring API .
`MonApiPermissionErr`	Permission	Service account is missing the Monitoring Metric Writer role ( `roles/monitoring.metricWriter`).	Grant the Monitoring Metric Writer role to the service account. For more information, see Agent lacks API permissions .
`MonApiScopeErr`	Permission	The VM is missing the https://www.googleapis.com/auth/monitoring.write access scope.	Add the https://www.googleapis.com/auth/monitoring.write scope to the VM. For more information, see Verify your access scopes .
`MonApiUnauthenticatedErr`	API	The current VM couldn't authenticate to the Monitoring API.	Verify that your credential files, VM access scopes, and permissions are set up correctly. For more information, see Authorize the Ops Agent .
`OtelMetricsPortErr`	Port availability	Port 20201, needed for Ops Agent self metrics, is unavailable.	Verify that port 20201 is open. For more information, see A required port is unavailable .
`PacApiConnErr`	Connectivity	This health-check code is unreliable. This code is disabled in Ops Agent version 2.46.1.	Update to version Ops Agent version 2.46.1 or above.

Find start-time errors

Starting with version 2.35.0, health-check information is written to the ops-agent-health log by the Cloud Logging API (versions 2.33.0, 2.34.0 use ops-agent-health-checks). The same information is also written to a health-checks.log file as follows:

Linux : /var/log/google-cloud-ops-agent/health-checks.log
Windows : C:\ProgramData\Google\Cloud Operations\Ops Agent\log\health-checks.log

You can also view any health-check messages by querying the status of the Ops Agent service as follows:

On Linux, run the following command:
```
   sudo systemctl status google-cloud-ops-agent"*"
   
```
Look for messages like "[Ports Check] Result: PASS". Other results include "ERROR" and "FAIL".
On Windows, use the Windows Event Viewer . Look for "Information", "Error", or "Failure" messages associated with the google-cloud-ops-agent service.

After you resolve any problems, you must restart the agent . The health checks are run when the agent starts, so to re-run the checks, you must restart the agent.

Find runtime errors

The runtime health checks are reported to both Cloud Logging and Error Reporting. If the agent failed to start but was able to report errors before failing, you might also see start-time errors reported.

To view runtime errors from the Ops Agent in Logging, do the following:

In the Google Cloud console, go to the Logs Explorer page:
Go to Logs Explorer

If you use the search bar to find this page, then select the result whose subheading is Logging .
Enter the following query and click Run query :
```
log_id("ops-agent-health")
```

To view runtime errors from the Ops Agent in Error Reporting, do the following:

In the Google Cloud console, go to the Error Reporting page:
Go to Error Reporting

You can also find this page by using the search bar.
To see errors from the Ops Agent, filter the errors for Ops Agent.

Verify successful log collection

Version 2.46.0 of the Ops Agent introduced the informational LogPingOpsAgent health check. This check writes an informational message to the ops-agent-health every 10 minutes. You can use the presence of these messages to verify that the Ops Agent is writing logs by doing any of the following:

Search logs of a specific VM for the ping messages by using Logs Explorer .
Check the value of the metric log_entry_count for a specific VM by using Metrics Explorer .
Create an alerting policy to notify you if a specific VM is not updating the log_entry_count metric.

If any of these options indicates that the log messages are not being ingested, then you can do the following:

Check for error codes indicating start-up errors or runtime errors .
Determine if the Ops Agent is up and running .
Run the agent diagnostics script .

To check the status of the Ops Agent on a specific VM, you need the instance ID of the VM. To find the instance ID, do the following:

In the Google Cloud console, go to the VM instances page:
Go to VM instances

If you use the search bar to find this page, then select the result whose subheading is Compute Engine .
Click the name of a VM instance.
On the Details tab, locate the Basic information section. The instance ID appears as a numeric string. Use this string for the INSTANCE_ID value in the subsequent sections.

Search for messages by using Logs Explorer

To use Logs Explorer to search the logs of a VM for the ping messages, do the following:

In the Google Cloud console, go to the Logs Explorer page:
Go to Logs Explorer

If you use the search bar to find this page, then select the result whose subheading is Logging .
To look for ping messages from the Ops Agent on a specific VM instance, enter the following query and replace INSTANCE_ID with the identifier for a Compute Engine VM, then click Run query :
```
resource.type="gce_instance"
resource.labels.instance_id="
INSTANCE_ID
"
log_id("ops-agent-health")
jsonPayload.code="LogPingOpsAgent"
    
```

View the `log_entry_count` metric

To use Metrics Explorer to check the value of the metric log_entry_count for a VM, do the following:

In the Google Cloud console, go to the Metrics explorer page:
Go to Metrics explorer

If you use the search bar to find this page, then select the result whose subheading is Monitoring .
In the Select a metric field, do the following:
1. Enter log entries.
2. For the Resource type , select VM Instance .
3. For the Metric category , select Logs-based metrics .
4. For the Metric , select Log entries .
5. Select Apply .
In the Filter field, add the following filters:
- Filter for a specific VM's instance ID:
  1. Select the resource label instance_id .
  2. Select the comparator = (equals) .
  3. Enter the INSTANCE_ID of a VM.
- Filter for the ops-agent-health log:
  1. Select the resource label log .
  2. Select the comparator = (equals) .
  3. Select the value ops-agent-heath .

Create an alerting policy for the `log_entry_count` metric

To create an alerting policy that monitors the value of the log_entry_count metric for log pings from a specific VM, do the following:

In the Google Cloud console, go to the Alerting page:
Go to Alerting

If you use the search bar to find this page, then select the result whose subheading is Monitoring .
If you haven't created your notification channels and if you want to be notified, then click Edit Notification Channels and add your notification channels. Return to the Alerting page after you add your channels.
From the Alerting page, select Create policy .
In the Select a metric field, do the following:
1. Enter log entries.
2. For the Resource type , select VM Instance .
3. For the Metric category , select Logs-based metrics .
4. For the Metric , select Log entries .
5. Select Apply .
In the Filter field, add the following filters:
- Filter for a specific VM's instance ID:
  1. Select the resource label instance_id .
  2. Select the comparator = (equals) .
  3. Enter the INSTANCE_ID of a VM.
- Filter for the ops-agent-health log:
  1. Select the resource label log .
  2. Select the comparator = (equals) .
  3. Select the value ops-agent-heath .
In the Transform data section, select the following:
- For the Rolling window field, select 10 min . To detect missing log entries over a longer period, enter a larger value.
- For the Rolling window function field, select delta .
Click Next .

The settings in the Configure alert trigger page determine when the alert is triggered. Complete this page with the settings in the following table.

Configure alert trigger page Field	Value
`Condition type`	`Threshold`
`Alert trigger`	`Any time series violates`
`Threshold position`	`Below threshold`
`Threshold value`	`1`
`Advanced Options: Retest window`	`No retest`

Click Next .
Optional: To add notifications to your alerting policy, click Notification channels . In the dialog, select one or more notification channels from the menu, and then click OK .
Optional: Update the Incident autoclose duration . This field determines when Monitoring closes incidents in the absence of metric data.
Optional: Click Documentation , and then add any information that you want included in a notification message.
Click Alert name and enter a name for the alerting policy.
Click Create Policy .

For more information, see Alerting policies .

Agent diagnostics tool for VMs

The agent diagnostics tool gathers critical local debugging information from your VMs for all the following agents: Ops Agent, legacy Logging agent, and legacy Monitoring agent. The debugging information includes things like project info, VM info, agent configuration, agent logs, agent service status, information that typically requires manual work to gather. The tool also checks the local VM environment to ensure it meets certain requirements for the agents to function properly, for example, network connectivity and required permissions.

When filing a customer case for an agent on a VM, run the agent diagnostics tool and attach the collected information to the case. Providing this information reduces the time needed to troubleshoot your support case. Before you attach the information to the support case, redact any sensitive information like passwords.

The agent diagnostics tool must be run from inside the VM, so you will typically need to SSH into the VM first. The following command retrieves the agent diagnostics tool and executes it:

Linux

curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh

Windows

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/diagnose-agents.ps1", "${env:UserProfile}\diagnose-agents.ps1")
Invoke-Expression "${env:UserProfile}\diagnose-agents.ps1"

Follow the output of the script execution to locate the files that include the collected info. Typically you can find them in the /var/tmp/google-agents directory on Linux and in the $env:LOCALAPPDATA/Temp directory on Windows, unless you have customized the output directory when running the script.

For detailed information, examine the diagnose-agents.sh script on Linux or diagnose-agents.ps1 script on Windows.

Agent diagnostics tool for automatic installation policies

If an attempt to install the Ops Agent by using an Ops Agent OS policy fails, you can use the diagnostics script described in this section for debugging. For example, you might see one of the following cases:

The Ops Agent installation fails when you used the Install Ops Agent for Monitoring and Logging checkbox to install the Ops Agent during VM creation .
The agent status on the Cloud Monitoring VM instances dashboard or the Observability tab on a Compute Engine VM details page stays in the Pending state for more than 10 minutes. A prolonged Pending status might indicate one of the following:
- A problem applying the policy.
- A problem in the actual installation of the Ops Agent.
- A connectivity problem between the VM and Cloud Monitoring.
For some of these issues, the general agent-diagnostics script and health checks might also be helpful.

To run the policy-diagnostics script, run the following commands:

curl -sSO https://dl.google.com/cloudagents/diagnose-ui-policies.sh
bash diagnose-ui-policies.sh 
VM_NAME
 VM_ZONE

This script shows information about affected VMs and related automatic installation policies.

When filing a customer case for an agent on a VM, run the agent diagnostics tools and attach the collected information to the case. Providing this information reduces the time needed to troubleshoot your support case. Before you attach the information to the support case, redact any sensitive information like passwords.

Agent status

You can check the status of the Ops Agent processes on the VM to determine if the agent is running or not.

Linux

To check the status of the Ops Agent, use the following command:

sudo systemctl status google-cloud-ops-agent"*"

Verify that the "Metrics Agent" and "Logging Agent" components are listed as "active (running)", as shown in the following sample output (some lines have been removed for brevity):

● google-cloud-ops-agent.service - Google Cloud Ops Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2023-05-03 21:22:28 UTC; 4 weeks 0 days ago
    Process: 3353828 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/go>
    Process: 3353837 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 3353837 (code=exited, status=0/SUCCESS)
        CPU: 195ms

[...]

● google-cloud-ops-agent-opentelemetry-collector.service - 
Google Cloud Ops Agent - Metrics Agent

     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
     Active: 
active (running)
 since Wed 2023-05-03 21:22:29 UTC; 4 weeks 0 days ago
    Process: 3353840 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=ot>
   Main PID: 3353855 (otelopscol)
      Tasks: 9 (limit: 2355)
     Memory: 65.3M
        CPU: 40min 31.555s
     CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
             └─3353855 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/g>

[...]

● google-cloud-ops-agent-fluent-bit.service - 
Google Cloud Ops Agent - Logging Agent

     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static)
     Active: 
active (running)
 since Wed 2023-05-03 21:22:29 UTC; 4 weeks 0 days ago
    Process: 3353838 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fl>
   Main PID: 3353856 (google_cloud_op)
      Tasks: 31 (limit: 2355)
     Memory: 58.3M
        CPU: 29min 6.771s
     CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
             ├─3353856 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_wrapper -config_path /etc/goo>
             └─3353872 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-clo>

[...]

● google-cloud-ops-agent-diagnostics.service - Google Cloud Ops Agent - Diagnostics
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-diagnostics.service; disabled; vendor preset: e>
     Active: active (running) since Wed 2023-05-03 21:22:26 UTC; 4 weeks 0 days ago
   Main PID: 3353819 (google_cloud_op)
      Tasks: 8 (limit: 2355)
     Memory: 36.0M
        CPU: 3min 19.488s
     CGroup: /system.slice/google-cloud-ops-agent-diagnostics.service
             └─3353819 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_diagnostics -config /etc/goog>

[...]

Windows

To check the status of the Ops Agent, use the following command:

Get-Service google-cloud-ops-agent*

Verify that the "Metrics Agent" and "Logging Agent" components are listed as "Running", as shown in the following sample output:

Status   Name               DisplayName
------   ----               -----------
Running  google-cloud-op... Google Cloud Ops Agent
Running  google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running  google-cloud-op... Google Cloud Ops Agent - Metrics Agent
Running  google-cloud-op... Google Cloud Ops Agent - Diagnostics

Agent self logs

If the agent fails to ingest logs to Cloud Logging, then you might have to inspect the agent's logs locally on the VM for troubleshooting. You can also use log rotation to manage the agent's self logs.

Linux

To inspect self logs that are written to Journald, run the following command:

journalctl -u google-cloud-ops-agent*

To inspect the self logs that are written to the disk by the logging module, run the following command:

vim -M /var/log/google-cloud-ops-agent/subagents/logging-module.log

Windows

To inspect self logs that are written to Windows Event Logs, run the following command:

Get-WinEvent -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent*' } | Format-Table -AutoSize -Wrap

To inspect the self logs that are written to the disk by the logging module, run the following command:

notepad "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"

To inspect the logs from the Windows Service Control Manager for Ops Agent services, run the following command::

Get-WinEvent -FilterHashtable @{ Logname='System'; ProviderName='Service Control Manager' } | Where-Object -Property Message -Match 'Google Cloud Ops Agent' | Format-Table -AutoSize -Wrap

View metric usage and diagnostics in Cloud Monitoring

The Cloud Monitoring Metrics Management page provides information that can help you control the amount you spend on chargeable metrics without affecting observability. The Metrics Management page reports the following information:

Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
Data about labels and cardinality of metrics.
Use of metrics in alerting policies and custom dashboards.
Rate of metric-write errors.

To view the Metrics Management page, do the following:

In the Google Cloud console, go to the Metrics management page:
Go to Metrics management

If you use the search bar to find this page, then select the result whose subheading is Monitoring .
In the toolbar, select your time window. By default, the Metrics Management page displays information about the metrics collected in the previous one day.

For more information about the Metrics Management page, see View and manage metric usage .

Find Ops Agent troubleshooting information

Agent health checks

Find start-time errors

Find runtime errors

Verify successful log collection

Search for messages by using Logs Explorer

View the log_entry_count metric

Create an alerting policy for the log_entry_count metric

Agent diagnostics tool for VMs

Agent diagnostics tool for automatic installation policies

Agent status

Linux

Windows

Agent self logs

View metric usage and diagnostics in Cloud Monitoring

View the `log_entry_count` metric

Create an alerting policy for the `log_entry_count` metric