Almost five years have passed since I wrote about joining foodpanda in Berlin. During those five years, so many incredible things have happened. I now have two kids, I moved from Prague mostly focused on writing code and now I am leaving Berlin as a full-time manager. I've also had the lucky pick of choosing one of the most exciting industries in Berlin. Already in 2015 we felt last-mile delivery would explode and the challenging nature of it kept me in Berlin for far longer than I originally anticipated.
Since June I'm already travelling back and forth between the two cities. Seeing the contrast and difference twice a week highlights all the things we've grown to love in Berlin with my wife.
In the spring we've went for a short vacation to the Baltic and stayed mostly in Denmark. There we realized it has even more of what we appreciate about Berlin. It's the focus on quality of life. Ecology is such a no brainer, the food in supermarkets is always of great quality, city planning optimizes for common happiness. I see it more and more as a core German trait (a recent one, perhaps?). Not meticulousness, not love of order, not puncutality, the common stereotypes. Copenhagen and the Danish are then on a whole other level. But hopefully Prague as well is slowly rising up on this north-south axis, coming closer to the other two year after year.
My talk from the 2019 AWS Summit in Berlin. I was talking about how we adapted our applications to running solely on Spot instances. Apart from the great cost saving effect, we've seen a great improvement in the resiliency of all our workloads.
AWS was preparing customers in us-east-1 for hurricane Florence and it remind me a bit of all the disastreous scenarios you can encounter when using AWS (and how you can mitigate them 100% better than if you we're running servers on-premise)
Scenario 1: hardware failure
One of the main advantages of cloud infra for me is not worrying about physical infrastructure at all. I recall a talk at an internal Rocket Internet Summit a few years ago. Some engineers had a 30 min talk how they found a bottleneck network link between their redis instances, affecting them for weeks. It felt crazy to spending time running around a datacenter instead of building ... products.
In general AWS will swap out any bad underlying hosts or disks without you knowing (I assume it's some sort of robot arms today). You will see a few practical implications though:
- Run everything in ASGs.
You'll be used to instances being terminated and with a bit of work you won't recognize the difference between scaling down after peak and a unhealthy instance being terminated.
One pattern I see underused is running your single EC2 instances in an ASG as well, with min/max set to 1. In case it gets terminated, the ASG will spin up a replacement without manual action. In the case that you mutate the server state predictably (e.g. capistrano deployments), you can also make AMI snapshots to make it completely seamless.
- For production workloads, Multi-AZ is expensive, but also a must-have. In our experience you will see errors for up to 20 seconds. Careful that 90% of default connection pool configs in database client libs can't handle a DNS change afterfailover (1, 2). After fixing these issues across all our apps and frameworks, we feel comfortable enough to trigger failovers during live operations.
- Same as for RDS, double check your redis client library if it can handle the failover gracefully. AWS allows you to trigger a manual failover at anytime
Scenario 2: AZ failure
In the Well-archited framework whitepapers, you'll find a lot of talking about availability zones. In the interface, AWS does a great job from giving an impression you have automatic datacenter redundancy, but you still need to be aware of how it all works. The first effect you might notice is that you will start accuring inter-AZ data transfer charges and I often see this as a source of frustration with EC2.
It's hard to detect an AZ failure in real time and there is no reliable API telling you which services in which AZ are not working properly. Some issues also manifest as connectivity issues between specific AZs. For that reason, we took a more manual emergency rulebook approach, which would allow the on-call engineer to follow steps to recover in case of a prolonged outage:
- Modifying auto-scaling policies to exclude the AZ we think is misbehaving
- Drain and decomission all instances already running within that AZ
- Manual failover all Multi-AZ RDS databases where the master is in the affected AZ
- Same manual failover for Elasticache clusters
One tough test for us was stateful cluster-forming applications, like Akka. Its default cluster formation heartbeats expect your instances to run within a single physical datacenter and cross-DC network jitter completely throws it off. By using the kubernetes API as a source of truth (etcd seems to be much more capable of handling cross AZ quorum maintenance) we were able to implement a very reliable split brain resolver handling the connectivity failures I mentioned above.
Scenario 3: Regional failure
We're not netflix, yet :)