If you’re interested in this topic (agree or disagree), we’d love to have you join the community.
Provisioning a Kubernetes Cluster
Our recommendation is to automate the setup of 2-4 Kubernetes clusters as managed services by one of the major cloud providers. In this list we include GKE, AKS, & EKS. If your organization has a large datacenter you’ll also find that the large “hybrid” solutions are pretty solid: PKS, VKS, OpenShift (but we’re not going into detail on those). We believe that the IaaS managed Kubernetes clusters provide organizations with the best foundation they need for managing the Kubernetes runtime (though running your own Kubernetes with Kubeadm or Kops isn’t out of the question).
Setup & Automation
To start, you’ll need some experience with your IaaS provide of choice (for k8s many folks love Google Cloud), but you’ll already need to have an account, a project, a VPC, etc. Also we suggest setting this up from the beginning with automation, Terraform as the most common tool for this automation. Regardless of if you are building your own cluster or using a managed service, the more Terraform you use, the more maintainable your cluster will be.
From there we suggest a workflow with multiple environments, each in a Kubernetes cluster. By using a staging and a production cluster, you can use the staging cluster to run acceptance and validation on before deploying to production. You could do this all in one cluster for application level UAT, but staging environments are useful when deploying new backing services (like networking) or updating the underlying K8s version (i.e. 1.14 -> 1.15),
Any production-grade Kubernetes cluster will need to implement some amount of observability. The pillars of observability are logging, monitoring and tracing.
Generally, we’ll want to take stdout & stderr from each pod and push them to a centralized SIEM. This is a fairly standardized practice and doesn’t require a lot of explanation. Tools like Fluentd and Beats are handy OSS components for exporting logs. Most commercial log aggregators include an agent that runs in your cluster to collect and ship log data.
Monitoring, visualization and alerting formats aren’t quite as standardized. Prometheus is the clear leader in the Kubernetes ecosystem, but Statsd still has the strongest enterprise adoption. In order to consume metrics from a wide variety of applications, it may be necessary to support multiple metric formats to start. Prometheus adapters like the Prometheus Statsd exporter can help bridge this gap.
Tracing is, at its core, mostly about discovering where time is spent in source code, so could be considered out of scope for modern on prem. That said, applications that can emit Distributed Tracing data can offer a lot of insight when diagnosing cascading failures in a system. The CNCF Opentracing documentation has a good list of OpenTracing implementations that are available today.
You’ll need to work with application vendors for a clear direction on what thresholds they usually alert on. Generally applications will ship with standard readiness and liveness probes, but this usually not enough in a distributed application. You’ll want to ensure you have application-level health checks tracking metrics that could be derived from disparate processes.
When deploying a Modern On-prem application, having OTS developer-defined, pre-baked healthchecks and alerting can rapidly accelerate time-to-recovery, and frees your operators from having to become experts in the software being deployed. Installing operators and other runtime plugins to facilitate this will enable both 1st- and 3rd-party development teams to ship more reliable, operable software.
Bundled Alerting with Prometheus operator
When you deploy a tool like Prometheus Operator, alerting thresholds can be bundled with each application as Kubernetes YAML, enabling application vendors to codify their recommended alerting thresholds.
Composite Health Checks: Application CRD
In a similar vein, the sig-apps Application CRD can define a distributed health-check that can define the whether an entire distributed application is healthy.