Systems development is like building a house. Everyone involved, including the future occupants, have a role to play. There are construction managers, architects and bricklayers. They all contribute to ensuring that the house gets built. Corne van Dyk, JUMO’s Head of Systems Development, is like our ‘property developer’ who oversees it all, but everyone has a hand in it before moving in.
I always describe SysDev as the plumbers of the business — the people who look after some of the most basic but essential infrastructure. At JUMO, the systems development team is responsible for all the underlying platforms and tools that other teams use, which includes everything from site reliability to user access to anything that doesn’t neatly fit into software development or data engineering.
A few years ago, we recognised that the number of systems we were supporting had gotten out of hand. Over the years, we had accumulated a lot of tools. This was mostly due to the growth phase the business was in. We’d employed a large number of new people, and many new joiners had their personal tooling preferences.
We tend to want to just keep everyone happy by allowing them to use the tools they like and are used to.
But of course, this resulted in substantial overlap in a number of areas. For example, at one point we had eight different systems dealing with logging, monitoring and alerting. This became untenable for SysDev, who had to support all these tools — never mind the fact that troubleshooting wide-scale incidents was next to impossible.
In addition, our compute platform (Docker Swarm at the time) was mostly homegrown and became hard to maintain. On several occasions we had to rebuild it from scratch and redeploy all services because of some strange hiccup. System knowledge was concentrated to a few people, causing them undue strain while putting the entire business at risk.
A while back, our team adopted the semi-flippant tagline ‘helping you, help yourself’ and we would quote this to people at every opportunity. In reality though, we were very much in the middle of everything, and starting to become a blocker. Teams were waiting on SysDev to create resources/tools for them, or to rebuild a system.
In the mix
As I mentioned, we had a sprawl of systems and tooling in various technology areas. Some of these are listed below:
Into the unknown
Admitting you have a problem is the first step to recovery, right? Once we had accepted the unpleasant situation we found ourselves in, we set some high-level goals:
- Consolidate the various technologies into one, or at most two, solutions.
- Choose solutions that show a level of futureproofing.
- Buy over build, unless homegrown has significant merit. (This doesn’t necessarily imply commercial solutions).
- Make the team tagline “helping you, help yourself” a reality.
We also introduced an RFC (Request For Comments) process where the problem statement as well as possible and preferred solutions could be discussed in a public forum. This allowed all interested parties to give input. We also did technical evaluations to ensure that the preferred systems would meet our requirements.
Finally, we chose AWS EKS (compute platform), Terraform (infrastructure-as-code), Jenkins & Spinnaker (CI/CD) and Datadog (logging and monitoring). For completeness, I’ll mention that we also decided to standardise our development language (Kotlin) and our application databases (AWS RDS Postgres on Aurora).
JUMO was planning for expansion into new territories where regulation required us to deploy a dedicated set of infrastructure and applications. We took the opportunity to implement a standardisation process. During the second half of 2019, SysDev spent many late nights putting version one into place and helped the development teams deploy their services onto the new infrastructure in early 2020.
I’ll freely admit that the initial versions of our various technology implementations were rough around the edges, but ultimately they worked, did what they were supposed to, and we’ve since been through several improvement cycles.
A bright new day
Did we succeed in what we set out to do?
I’ll say this loud and clear: we haven’t had a significant outage on our computer platform since migrating to EKS. This is in part due to using AWS managed services, but also because the technology is much better understood.
We have improved observability, as all the logs, monitors and alarms are located in a single place.
We have a more controlled and secure environment now, as infrastructure creation follows the same engineering principles of code review and pipelined deployments. We can enforce mandatory security controls, such as encryption and multi-AZ configurations, using Terraform modules. We’ve also removed the ability for humans to create infrastructure, by giving CRUD permissions only to the pipeline.
We now run systems where knowledge and skills are widely spread across the team, avoiding the situation where just a few individuals know what’s going on. This has many benefits such as growth and learning opportunities for team members and increased inclusion, as no-one is seen as having exclusive rights to ‘the cool toys’.
Most importantly, SysDev has stopped being a blocker for other teams by, yes, “helping you, help yourself”.