If you’re interested in this topic (agree or disagree), we’d love to have you join the community.
Why Kubernetes as the Foundation of Modern On-prem
Kubernetes matters. It is creating a new era of software development and deployment–an era with a common implementation of a sophisticated solution to a complex problem. The primary and secondary effects of that change will be profound. Over the last 5 years, Kubernetes has emerged as the canonical implementation of the patterns that allow modern software to be truly reliable & scalable. We’ll quickly examine the recent context that led to these patterns emerging, the origins of Kubernetes, and explore the ecosystem that has emerged around Kubernetes.
Modern Reliability Challenges
First, it is important to remember that the web wasn’t always as reliable as we know it today. Most of the early “web-scale” companies (including Amazon & eBay) struggled with determining how to operate their applications at a level of global scale and availability that hadn’t been required of software before. Site-wide outages were common and software updates often required planned downtime (some legacy SaaS companies still do this today). For many years the best solution was to buy expensive, physical servers and throw bodies (sysadmins and DBAs) at the problem. As VMs and IaaS became the common infrastructure choice, tools like Chef and Puppet emerged to facilitate automated configuration management. This allowed diligent teams with an extreme focus on writing scalable applications to succeed. However, the patterns weren’t perfect. Many apps still had reliability problems.
Another memorable example of reliability issues that many early web scale companies faced was Twitter’s infamous “fail whale.” Early Twitter users will recall the frequency of the “fail whale” icon as the service attempted to keep up with staggering user growth. Service interruptions like this at Twitter weren’t the result of their team not knowing how to write great software, they occurred because only a handful of people at the time knew how to write and operate truly reliable software at global scale.
Google’s Solution for Reliability
While much of the world was struggling with downtime and fail whales, the team at Google was taking a different approach to creating a scalable architecture. Rather than trying to scale services vertically on mainframe servers, they opted for a distributed system on commodity hardware. Initially, this exacerbated reliability issues, as it combined the standard challenges of distributed systems (i.e. networking, load balancing, consistency, availability, data partitioning performance, etc.) with an increased likelihood of node failure from commodity hardware. It became apparent fairly quickly that they were unable to scale this system without writing software to manage it, so they got to work and eventually created Borg. In doing so (and in developing iterations and improvements over the course of many years), they established the patterns and internal primitives of container orchestration and scheduling.
As the system matured, the underlying primitives became increasingly important and even spawned an internal ecosystem of tools developed at Google (i.e. BorgMon, Istio, Spanner). These tools furthered the standardization of application architecture to allow for even more reliable services to be offered. Google eventually wrote a book on all these tools, techniques, and the organizational processes that put them into practice, Site Reliability Engineering.
From an application development perspective, the important thing about Borg was that the SRE team that developed and managed Borg would take over responsibility of operating the applications that ran on it. For an application developer, this meant that as long as they deployed their apps with Borg and with all the requisite patterns (telemetry output, fault-recovery, etc), they were able to hand off significant operational responsibility to the SRE team. As long as they used the patterns and primitives that Borg prescribed, and they could focus on application development, not operations (ironically this meant that you had to understand the patterns and primitives of operations, BUT you wouldn’t have to perform the operations tasks).
From Google to Everyone Else
Twitter was one of the first companies to realize that Google was onto something with Borg, and they wanted to mimic it. When one of the founders of the OSS project Mesos came to visit their office, they saw that Mesos had similar concepts for container orchestration and scheduling. Eventually this system and the ecosystem of tools they built around it (like their service mesh) allowed Twitter to provide the globally available and reliable service that we all are familiar with today (and to officially retire the fail whale). Since 2011, Facebook was working on a similar and proprietary system that they’ve just started to discuss publicly called Tupperware which is a container-centric approach to cluster management and has many components that map onto Kubernetes.
Then, in 2013, Docker kicked off the container craze and Google quickly realized that they could compete better with AWS if containers overtook VMs. A few months later, they open sourced a next generation version of Borg, Kubernetes (for a very detailed look at the origins, check out How Kubernetes came to rule the world). It didn’t take long for many other vendors in the industry to embrace Kubernetes (CoreOS, RedHat, Pivotal, IBM, Microsoft, eventually Amazon, Mesosphere and Docker) as the defacto platform for cloud-native applications and start offering a managed or supported version of Kubernetes. Simultaneously, organizations large and small began adopting Kubernetes as the platform on which their stack would be built. These end-users were ripping out both custom and fragmented commercial solutions for deploying their applications and replacing what they could with Kubernetes and its ecosystem. For these organizations, the benefits were clear: one less problem to solve, one more way to standardize their processes on proven tools, less vendor lock-in, and the promise of increasingly improving tools.
Ubiquity Creates an Ecosystem and New Opportunities
During Kubernetes ascendance, countless “secondary vendors” recognized that the neutrality and momentum of Kubernetes was the perfect ecosystem with which to align. These secondary vendors weren’t selling managed or supported versions of Kubernetes itself. Rather, they were capitalizing on the impacts of Kubernetes adoption. Everything in the infrastructure layer and SDLC had to have a container story and, eventually, a Kubernetes native integration: networking, storage, testing, monitoring, CI, security, compliance, etc (check out the CNCF Landscape to grasp the breadth of this ecosystem). As is the case with any new platform shift, the change opened the door for startups to compete on new fronts with incumbent software vendors whose technologies were mainly focused on the VMs. This has driven significant venture capital interest, acquisitions, pivots, rebrands and internal investments from incumbents. This feeds a virtuous cycle where improved tools and components speed end-user adoption and open up more market opportunity and investment from vendors.
Moving beyond the primary Kubernetes vendors and the secondary vendors reinventing various parts of the SDLC to be Kubernetes-first, the success of this shift will have a variety of downstream impacts and business value for Kubernetes end-users:
- Instead of teams building alternative implementations of similar patterns, they can now focus on adding value on top of the existing primitives and move up a layer.
- Developers can now join a company and expect a similar set of patterns and primitives for software development (similar to the value offered by having a common toolset of GitHub, Jira, Slack etc).
- Software is more portable and reusable with much of the operational knowledge baked into the Kubernetes manifests. This matters for internally developed components, open source software, and commercial off-the-shelf (COTS) software (a trend we’ve coined “Modern On-prem”).
- More reliable software means less downtime for the services we all use day-to-day.
Ultimately, it isn’t necessarily that Kubernetes matters. What matters is that the industry has unified on a single open platform to power the future of developing and deploying reliable software at scale. We’re still in the early innings of the impact that this will have, but all the early signs point to it being a tectonic shift.