Exit Code 137

Modified on Thu, 1 Dec, 2022 at 10:36 AM

One of the more cryptic error messages Gremlin has is referred to as 137

Gremlin did not exit normally: container failure: runc: exit Some(137)

This actually doesn't come from Gremlin itself, but rather is the error code returned to Gremlin by the container runtime (in this example via the OCI compliant runc ). By convention Container runtimes return the exit code of the container in question by adding 128 to it. This is to disambiguate between the exit code of the executable itself (which will return an exit code < 128). By this math, we can thus determine that the 137 here is actually made up of 128 + 9. As many *nix programmers are intimately familiar 9 happens to be the same code as SIGKILL, and can be returned when the Kernel forces a process to exit.

We can reproduce this locally:

Start up a process

% sleep 10000

In a second session we kill the process

% ps axuww | grep sleep | grep -v grep | awk '{{ print $2 }}' | xargs kill -9

and we get an output in the first session like:

% sleep 10000
[1] 96025 killed sleep 10000
% echo $?
137

Why do 137s happen?

There are many reasons why some supervising software might choose to terminate a "runaway" process. Most container runtimes have built-in memory killers (sometimes referred to as OOMKillers). Supervisord will send SIGKILL if the managed process is unresponsive to other IPC. Systemd will send SIGKILL after a given timeout (if configured as such). The most common problem Gremlin users typically run into is related to Kubernetes.

Kubernetes
Kubernetes will kill a given container (and by extension any Gremlin attacks targeting that container) under a variety of conditions:

Liveness Probe Failures
Breach of Resource Limits
Preemption or Eviction of Pods (note: your Pod might get Evicted because of the behavior of an unrelated Pod)
OOMKillers

For some of the above cases, Kubernetes provides a way to configure when pods are terminated via the terminationGracePeriodSeconds field in the Lifecycle Hooks handler, but most commonly you want to preserve the current behavior of Kubernetes.

This is a resilience mechanism. Kubernetes has determined that something has gone wrong with the Pod/Containers and is attempting to automatically heal. When running Kubernetes attacks, Gremlin attacks run in the same Linux namespaces as your attack targets, so when Kubernetes initiates this resilience mechanism, it kills Gremlin along with your targets.

Debugging

While killing, and restarting, the impacted container is likely the desirable outcome, it's still nice to be able to confirm that's what's happened. There are a couple of places you can look. First, you can look at the containers themselves:

% kubectl get pods -n $NAMESPACE $TARGET_POD -o go-template='{{range .status.containerStatuses}}{{printf "%d : %s\n" .lastState.terminated.exitCode .lastState.terminated.reason}}{{end}}'
129 : Error
137 : Error
137 : Error

This shows us the most recent exit code for each container in this pod (there are 3 containers in my example). The first one exited with 129 (I ran a shutdown gremlin: 128 + 1) and the sister containers were terminated by Kubernetes, thus received 137s.

Next, we can look at events stored in Kubernetes:

% kubectl get event -n $NAMESPACE --field-selector involvedObject.name=$TARGET_POD
62s Normal Pulling pod/examplePod-759cbc656d-85thp Pulling image "gremlin/example-service:1.6.1894"
61s Normal Created pod/examplePod-759cbc656d-85thp Created container example-service
61s Normal Started pod/examplePod-759cbc656d-85thp Started container example-service
36s Warning Unhealthy pod/examplePod-759cbc656d-85thp Startup probe failed: Get "http://10.244.49.7:8080/v1/probes/readiness": dial tcp 10.244.49.7:8080: connect: connection refused
61s Normal Pulled pod/examplePod-759cbc656d-85thp Successfully pulled image "gremlin/example-service:1.6.1894" in 867.166177ms

Here we see an example of a Pod that comes up and is initially unhealthy due to failing readiness probes

Finally, we can always examine the status of the Pod itself to make sure it's in a good state post attack. Many times when Kubernetes takes a Pod offline it'll either restart it or replace it with another pod so our primary concern is that we're back in a healthy state.

% kubectl get pods -n $NAMESPACE $TARGET_POD 
NAME                       READY STATUS RESTARTS AGE
examplePod-759cbc656d-85thp 3/3  Running    4    20h

Exit Code 137

Why do 137s happen?

KubernetesKubernetes will kill a given container (and by extension any Gremlin attacks targeting that container) under a variety of conditions:

Debugging

Kubernetes
Kubernetes will kill a given container (and by extension any Gremlin attacks targeting that container) under a variety of conditions: