Scaling infrastructure through Covid – How Dev and Ops can work even closer together
Thu 26 Nov 2020 | Ed Hoppitt
Scaling existing systems is a key component of the coronavirus challenge. Cloud-native patterns, practices, and technologies can help
The COVID-19 crisis has brought many challenges and changes for IT teams. There’s the immediate business challenges of keeping employees connected and productive – and supporting remote working over a VPN designed to handle a tiny fraction of the workforce, for example.
Then, there are the wider commercial challenges of serving customers digitally at an unprecedented scale. Supermarkets, for example, are facing exponential growth in online ordering. If there was ever a time when you needed a real-time view of store inventory to support click and collect, now is that time.
Finally, there are the challenges of quickly delivering new digital experiences to support navigating the crisis itself: the Government and the NHS need a real-time view of ICU beds and ventilator availability, contact-tracing to lift shielding orders and assess cases, and financial institutions supporting a range of customers through the economic fallout.
Scaling existing systems is hard. Scaling existing systems while also introducing new applications that need to be imminently scaled nationwide can feel impossible. Cloud-native patterns, practices, and technologies can help. But for most organisations those patterns, practices and technologies aren’t pervasive yet.
So, what can you do, in this moment of crisis? After all, architectures can’t take a u-turn. But as we’ve seen throughout history, human behaviours can, especially in times of crisis. In order to scale and get to production faster, both developers and ops managers must make their respective contributions to even closer and more efficient cooperation, now and in the future.
Hurdles that often stand in the way of closer cooperation
Before exploring ways developers and operators can help each other, we need to understand what currently gets in the way of scaling and speed to market. There are many things. Some of the bigger, systemic reasons often come down to:
Mis-match of goals and objectives. Operations teams are often aligned and measured around cost and stability. Maintaining uptime and reducing costs is the name of the game. While those are important, they don’t mesh well with changes. And scaling events and new features or applications introduce change into the environment. Meanwhile, developers are often more aligned to business goals, and may be measured in feature velocity. In other words, they are all about change and sometimes stability is an afterthought, or worse, it’s “someone else’s problem.”
Mis-match of dev and prod environments. When different infrastructure and tools are used in production versus development, you end up with a language barrier between dev and ops. “Worked in dev. Ops problem now” is the classic refrain that stems from this dynamic. This slows transitions into production because new issues crop up in new environments. This also slows troubleshooting during incidents, often leading to longer mean time to resolution (MTTR).
Tips on how developers can help ops managers:
Ops teams carry the responsibility for uptime, and have historically spent a lot of time making sure that infrastructure is available. In a cloud-native world, this gets turned on its head, as infrastructure is assumed to be unreliable. But ops teams can’t do much to the code itself. Developers, however, can make their code more uptime friendly and scalable:
- Ensure the “observability” of the code. Instrumenting your code from the get go pays off in the long run. Spring has made this relatively easy with Spring Boot Actuators and Micrometer. The faster the reasons for downtime are identified, the higher the uptime. Using common tooling for monitoring and observability also helps bridge the dev/prod mis-match.
- Ensure that the application can be restarted. If there are 15 steps and a strict order of operations to start your application, this is going to complicate uptime and scaling. If it takes 30 minutes to boot all the logical layers in your application, moving your application to healthy infrastructure in the event of a failure will be painful. The faster and simpler it is to restart your application, the more resilient it can be to failures. This is a fundamental assumption in using a system like Kubernetes, which automates starts and restarts in an infinitely reconciling loop. This one factor of fast restarts allows ops teams to take advantage of powerful automation.
- Ensure stateless processes wherever you can. This is a classic 12-factor principle, but it makes a big difference in scaling. Storing state in the application introduces all kinds of complexity in managing data. Data in distributed systems is a hard problem. The more you can run as stateless, the more you can isolate the data tier and its complexities.
By thinking of how to make code more uptime friendly and scalable, developers help ops teams with their goals and objectives. This helps to bridge that mis-match that often occurs between these groups. It also helps deliver a better customer experience, particularly during this crisis when customers depend more on digital services.
Tips on how ops managers can help developers
If developers can help ops teams by taking on some of their concerns, the same goes for operators helping developers. However, developers’ needs don’t fit quite as nicely into a pattern, like ensuring uptime.
- Speed the path to production. Developers may be all over the board with the problems they are trying to solve and the tools they are bringing to bear. But nobody likes waiting. Waiting for a dev environment, waiting for a change window, waiting for any number of checkpoints to get code into production. Look for ways to reduce the waiting game for developers.
- Get into a listening rhythm. Infrastructure teams often spend most of their time minding, well, the infrastructure. Ops teams can get a better sense of what would help developers by spending more time with those teams. Scheduling office hours, dropping in on dev team meetings, and asking developers for feedback are some quick changes to start those conversations. If ops teams are spending 10% of their time with developers, try to get it up to 30%. If it’s 30, get it to 50.
- Find ways to say yes. Ask any parent of a toddler and they’ll tell you that “no” is an easier word to say. It’s almost a human instinct rooted in a need to deal with chaos, complexity, and threats to our basic needs. And it burns bridges faster than fire. When developer requests come in, instead of being the Department of No, slow down to see if you can find a way to say yes. Getting to the root of why something is asked for is a helpful way to identify ways to say “yes.”
Closer cooperation for faster, more agile responses
For companies that have been on the path to cloud-native, the pandemic has proved out the difficult changes they’ve been investing in. From launching new applications in days, to scaling 10x without breaking a sweat, these patterns, practices, and technologies are paying off. They can’t be turned on overnight, but there are small steps that individuals can take to make scaling, uptime, and speed to market more seamless. COVID-19’s challenges act as a catalyst for what is already required and will be even more required in the future: closer cooperation between developers and ops managers.