One of the more cryptic error messages Gremlin has is referred to as
137
Gremlin did not exit normally: container failure: runc: exit Some(137)
This actually doesn't come from Gremlin itself, but rather is the error code returned to Gremlin by the container runtime (in this example via the OCI compliant
runc
). By convention Container runtimes return the exit code of the container in question by adding 128
to it. This is to disambiguate between the exit code of the executable itself (which will return an exit code < 128). By this math, we can thus determine that the 137
here is actually made up of 128
+ 9
. As many *nix programmers are intimately familiar 9
happens to be the same code as SIGKILL
, and can be returned when the Kernel forces a process to exit. We can reproduce this locally:
Start up a process
% sleep 10000
In a second session we kill the process
% ps axuww | grep sleep | grep -v grep | awk '{{ print $2 }}' | xargs kill -9
and we get an output in the first session like:
% sleep 10000 [1] 96025 killed sleep 10000 % echo $? 137
Why do 137s happen?
There are many reasons why some supervising software might choose to terminate a "runaway" process. Most container runtimes have built-in memory killers (sometimes referred to as OOMKillers). Supervisord will send
SIGKILL
if the managed process is unresponsive to other IPC. Systemd will send SIGKILL
after a given timeout (if configured as such). The most common problem Gremlin users typically run into is related to Kubernetes.Kubernetes
Kubernetes will kill a given container (and by extension any Gremlin attacks targeting that container) under a variety of conditions:
Liveness Probe Failures
Breach of Resource Limits
Preemption or Eviction of Pods (note: your Pod might get Evicted because of the behavior of an unrelated Pod)
OOMKillers
For some of the above cases, Kubernetes provides a way to configure when pods are terminated via the
terminationGracePeriodSeconds
field in the Lifecycle Hooks handler, but most commonly you want to preserve the current behavior of Kubernetes.This is a resilience mechanism. Kubernetes has determined that something has gone wrong with the Pod/Containers and is attempting to automatically heal. When running Kubernetes attacks, Gremlin attacks run in the same Linux namespaces as your attack targets, so when Kubernetes initiates this resilience mechanism, it kills Gremlin along with your targets.
Debugging
While killing, and restarting, the impacted container is likely the desirable outcome, it's still nice to be able to confirm that's what's happened. There are a couple of places you can look. First, you can look at the containers themselves:
% kubectl get pods -n $NAMESPACE $TARGET_POD -o go-template='{{range .status.containerStatuses}}{{printf "%d : %s\n" .lastState.terminated.exitCode .lastState.terminated.reason}}{{end}}' 129 : Error 137 : Error 137 : Error
This shows us the most recent exit code for each container in this pod (there are 3 containers in my example). The first one exited with 129 (I ran a shutdown gremlin: 128 + 1) and the sister containers were terminated by Kubernetes, thus received 137s.
Next, we can look at events stored in Kubernetes:
% kubectl get event -n $NAMESPACE --field-selector involvedObject.name=$TARGET_POD 62s Normal Pulling pod/examplePod-759cbc656d-85thp Pulling image "gremlin/example-service:1.6.1894" 61s Normal Created pod/examplePod-759cbc656d-85thp Created container example-service 61s Normal Started pod/examplePod-759cbc656d-85thp Started container example-service 36s Warning Unhealthy pod/examplePod-759cbc656d-85thp Startup probe failed: Get "http://10.244.49.7:8080/v1/probes/readiness": dial tcp 10.244.49.7:8080: connect: connection refused 61s Normal Pulled pod/examplePod-759cbc656d-85thp Successfully pulled image "gremlin/example-service:1.6.1894" in 867.166177ms
Here we see an example of a Pod that comes up and is initially unhealthy due to failing readiness probes
Finally, we can always examine the status of the Pod itself to make sure it's in a good state post attack. Many times when Kubernetes takes a Pod offline it'll either restart it or replace it with another pod so our primary concern is that we're back in a healthy state.
% kubectl get pods -n $NAMESPACE $TARGET_POD NAME READY STATUS RESTARTS AGE examplePod-759cbc656d-85thp 3/3 Running 4 20h
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article