Lessons learned from Monzo’s Kubernetes outage
Thu 3 May 2018
Monzo engineers suffered a very unenjoyable day in October last year when the online bank experienced a total outage for an hour and 21 minutes. At KubeCon in Copenhagen, Oliver Beattie, Monzo’s head of engineering, talked through the outage and discussed some of the lessons he learned after experiencing every engineer’s worst-case scenario.
A full breakdown of the Monzo outage can be found on its community forums, where Beattie published the details shortly after it occurred. In brief, the problem related to an incompatibility between the specific versions of Linkerd and Kubernetes that Monzo was using.
Linkerd is a tool from Buoyant which Monzo uses to allow microservices to ‘talk’ to each other. In this instance, it was found that Linkerd was sending information to IP addresses that didn’t exist. Once engineers tried to restart Linkerd pods in order to fix this problem, they found that the pods weren’t starting at all.
This brings us to one of Beattie’s major lessons. Citing Netflix’s Simian Army, the primary and most advanced toolkit for employing chaos engineering, Beattie advocated for the use of chaos engineering as a way of discovering problems before they have a real-life impact.
Chaos engineering is ‘the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.’ Had Monzo used this, Beattie said, they “would have caught the issue before it became an issue.”
One successful aspect of this fairly large failure for Monzo was that customers were still able to make payments. In fact, despite the fact that it was an extremely “hairy” day for the team at Monzo, Beattie noted that there were actually only two customer complaints. This is, he acknowledged, “not ideal”, but it is fair to say it could have been worse.
The reason payments were able to be made despite suffering a total outage is because of another borrowed concept – this time taken from security – that of ‘defence in depth.’ By having multiple layers of redundancy that can compensate for each other if something fails, Beattie noted that they had taken something usually designed for security and applied it to reliability.
Because of this, customers couldn’t see transactions in their app but crucially they could make payments. “It really paid off to have built that,” said Beattie, who also argued that this is a lesson that can apply much more broadly than just in payments.
A final lesson, and one which Beattie claims runs through the culture at Monzo is the need to embrace transparency. Transparency is the “default position” at Monzo, he said, and Beattie encouraged other organisations to think in the same way. Because of his decision to publish details of the outage online, he said, they were able to delve much further into the root causes of the problem in a way they wouldn’t have if it had remained internal.
“Companies are very good at talking about their successes,” Beattie finished by saying, “but not their failures, and I think we can all learn from each other’s experiences, good and bad.”