Testing Failure of Cloud Provider Services

Modified on Thu, 13 Jun, 2024 at 4:24 PM

When deploying Gremlin, an important consideration is how to test the failure condition of cloud provider dependencies. In the context of reliability engineering with Gremlin, object storage and managed database services (such as AWS S3 and AWS RDS) are considered dependencies as the underlying hosts are not under your control, and the ability to test them directly is consequently limited.

The correct question to ask when testing is "how does my application behave when a given dependency fails?"

At the application level you can get a realistic simulation of a failure state by performing network experiments against the URL of the service in question. You can simulate a full outage (blackhole), service degradation (latency/packet loss), or DNS failover.

In serverless deployments where there is no access to the underlying infrastructure (as in the case of AWS Lambda/AWS Fargate) Gremlin recommends using Failure Flags, our application-level fault injection solution. You should add these flags to application calls to your dependencies (e.g. RDS/S3). If preferred, you can inject custom exceptions/error conditions from within your application as well.