Troubleshooting Unhealthy State

Created by Cortney Roberson, Modified on Wed, 09 Nov 2022 at 04:31 PM by Cortney Roberson

This page contains troubleshooting instructions for errors you might encounter. If you can't find the answer to your question, check the Gremlin knowledge base for additional information.


Gremlin Agent

Unhealthy State

Check that you are only running attacks on active Gremlin Agents. It's possible to run an attack on a Gremlin Agent in an unhealthy state but the attack may not complete. An unhealthy state indicates that there was an issue with the installation or configuration of the Gremlin Agent. If you see a Gremlin Agent in an unhealthy state or you are experiencing problems running attacks, such as receiving "Attack Interrupted" errors, refer to the Gremlin knowledge base for more information.

Troubleshooting attacks in an unhealthy state


There are several reasons a Gremlin Agent can lose communication to the Gremlin Control Plane. Common examples include

  • Running a network based attack that affected the traffic. Ensure both and DNS are white-listed.
  • Running a CPU attack has starved Gremlin of the ability to compute API encryption. This is rare but it does happen.

In the event of a LostCommunication error, The Gremlin Agent will trigger its dead-man switch and cease all attacks.

This can occur on a host when running a network attack, when a previous network attack had been run and the Gremlin Agent was halted mid attack by the user, system, or other tool which did not allow Gremlin to run garbage collection.

To solve, run gremlin rollback.

Failed to parse execution attribute ‘pid’ for execution < HASH_STRING >

There are two non-exclusive modes of failure that can occur with this error message:

  • The running version of Gremlin is several versions out of date
    • Update the Gremlin Agent or Docker image
  • /var/lib/gremlin/executionshas become corrupt
    • Delete the file /var/lib/gremlin/executions


Run Chao in debug mode

Chao supports the GODEBUG environment variable, which can be used to enable debug features such as verbose logging of HTTP activity. You can enable verbose HTTP logs by adding the following variable to the environment section of the Chao deployment.

NOTE: Verbose logging prints sensitive information like HTTP request and response bodies. This configuration is intended to be a troubleshooting measure only, and should be removed when no longer needed.


- name: GODEBUG
value: http2debug=2

Run Gremlin checks

You can run Gremlin's check subcommand on Kubernetes clusters to troubleshoot common configuration or compatibility issues with the environment. The following is an example Job that you can run to get gremlin check output.


apiVersion: batch/v1
kind: Job
name: gremlin-check
namespace: gremlin
k8s-app: gremlin
version: v1
labels: gremlin-check
restartPolicy: Never
- name: gremlin
image: gremlin/gremlin
# You can also pass subcommands (like `proxy` to check only proxy information)
args: [ "check" ]
# # Pass the same environment you would pass to the Gremlin DaemonSet, including secrets, and proxy information
value: file:///var/lib/gremlin/cert/gremlin.key
value: file:///var/lib/gremlin/cert/gremlin.cert
fieldPath: spec.nodeName
# # Example proxy configuration
# - name: https_proxy
# value: http://my-proxy:3128
# - name: SSL_CERT_FILE
# value: /etc/gremlin/ssl/proxy-ca.pem
# value: my-team-id
- name: docker-sock
mountPath: /var/run/docker.sock
- name: gremlin-state
mountPath: /var/lib/gremlin
- name: gremlin-logs
mountPath: /var/log/gremlin
- name: gremlin-cert
mountPath: /var/lib/gremlin/cert
readOnly: true
# # Example proxy configuration
# - name: proxy-ca
# mountPath: /etc/gremlin/ssl
- name: docker-sock
path: /var/run/docker.sock
- name: gremlin-state
path: /var/lib/gremlin
- name: gremlin-logs
path: /var/log/gremlin
- name: gremlin-cert
secretName: gremlin-secret
# # Example proxy configuration
# - name: proxy-ca
# configMap:
# name: proxy-ca
backoffLimit: 4

Once deployed, you can get the output of gremlin check by pulling the logs of the Pod associated with the Job:


kubectl logs --follow \
--namespace gremlin \
$(kubectl get pods --namespace gremlin --selector=job-name=gremlin-check --output=jsonpath='{.items[*]}')

https_proxy : http://proxy.local:3128
http_proxy : (unset)
SSL_CERT_FILE : /etc/gremlin/ssl/proxy-ca.pem
Service Ping : OK


Non-zero exit code (137)

Docker has killed the container via kill -9. This is often attributed to OOM issues, and is most often seen when running a memory attack. Allocating more RAM to Docker usually solves the issue.

Non-zero exit code (1)

  • Unable to find local credentials file: Gremlin is not configured to point to the correct credentials file, usually located in /var/lib/gremlin. Ensure the credentials file(s), either certificates of API keys, exists and Gremlin has read+write access.

  • Permission denied (os error 13): The Gremlin container does not have proper filesystem permissions. Gremlin requires write access to /var/lib/gremlin, including the ability to create new files. Check permission on the host, and ensure write access is being passed through via Docker when running the Gremlin container.

OS Error 1

This is often observed in the context of Capabilities: Unable to inherit one or more required capabilities: cap_net_admin, cap_net_raw

Solution: Add the missing required Linux capabilities to that Docker container.

Example: docker run -it --cap-add=NET_ADMIN --cap-add=KILL --cap-add=SYS_TIME gremlin/gremlin syscheck

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article