Dealing with a catastrophe

posted 12.09.2018

AWS was preparing customers in us-east-1 for hurricane Florence and it remind me a bit of all the disastreous scenarios you can encounter when using AWS (and how you can mitigate them 100% better than if you we're running servers on-premise)

Scenario 1: hardware failure

One of the main advantages of cloud infra for me is not worrying about physical infrastructure at all. I recall a talk at an internal Rocket Internet Summit a few years ago. Some engineers had a 30 min talk how they found a bottleneck network link between their redis instances, affecting them for weeks. It felt crazy to spending time running around a datacenter instead of building ... products.

In general AWS will swap out any bad underlying hosts or disks without you knowing (I assume it's some sort of robot arms today). You will see a few practical implications though:

EC2
Run everything in ASGs.
You'll be used to instances being terminated and with a bit of work you won't recognize the difference between scaling down after peak and a unhealthy instance being terminated.
One pattern I see underused is running your single EC2 instances in an ASG as well, with min/max set to 1. In case it gets terminated, the ASG will spin up a replacement without manual action. In the case that you mutate the server state predictably (e.g. capistrano deployments), you can also make AMI snapshots to make it completely seamless.
RDS
For production workloads, Multi-AZ is expensive, but also a must-have. In our experience you will see errors for up to 20 seconds. Careful that 90% of default connection pool configs in database client libs can't handle a DNS change afterfailover (1, 2). After fixing these issues across all our apps and frameworks, we feel comfortable enough to trigger failovers during live operations.
Elasticache
Same as for RDS, double check your redis client library if it can handle the failover gracefully. AWS allows you to trigger a manual failover at anytime

Scenario 2: AZ failure

In the Well-archited framework whitepapers, you'll find a lot of talking about availability zones. In the interface, AWS does a great job from giving an impression you have automatic datacenter redundancy, but you still need to be aware of how it all works. The first effect you might notice is that you will start accuring inter-AZ data transfer charges and I often see this as a source of frustration with EC2.

It's hard to detect an AZ failure in real time and there is no reliable API telling you which services in which AZ are not working properly. Some issues also manifest as connectivity issues between specific AZs. For that reason, we took a more manual emergency rulebook approach, which would allow the on-call engineer to follow steps to recover in case of a prolonged outage:

  • Modifying auto-scaling policies to exclude the AZ we think is misbehaving
  • Drain and decomission all instances already running within that AZ
  • Manual failover all Multi-AZ RDS databases where the master is in the affected AZ
  • Same manual failover for Elasticache clusters

One tough test for us was stateful cluster-forming applications, like Akka. Its default cluster formation heartbeats expect your instances to run within a single physical datacenter and cross-DC network jitter completely throws it off. By using the kubernetes API as a source of truth (etcd seems to be much more capable of handling cross AZ quorum maintenance) we were able to implement a very reliable split brain resolver handling the connectivity failures I mentioned above.

Scenario 3: Regional failure

We're not netflix, yet :)