Observability

Observability in Modern On-prem Applications

As a vendor, it’s important to ensure an application is highly operable from an observability standpoint. Many SaaS and on-prem software teams are familiar with these concepts, but delivering Modern On-prem applications means more than just exposing metrics, logs, and traces. When the team operating the software (the end user) are not experts in its operation, it’s vital to focus on delivering actionable insights.

Bundled Alerting with Prometheus operator

When end-users deploy a tool like Prometheus Operator, alerting thresholds can be bundled with an application as Kubernetes YAML, enabling teams to codify their recommended alerting thresholds. This is a vast improvement over just exposing metrics, or even publishing “recommended thresholds” in user-facing documentation

Composite Health Checks: Application CRD

In a similar vein, the sig-apps Application CRD can define a distributed health-check that can define the whether an entire distributed application is healthy. This can be included in the CRD, with or without the operator, as a way of codifying what services in an application need to be up for it to be considered “healthy”.

Disconnected troubleshooting

One of the core selling points of Modern On-prem is that end-customers won’t need to become experts in each OTS application they’re deploying. When things go wrong, sometimes exposing metrics, logs, and traces to an end customer won’t be sufficient to enable users to self-diagnose the issue. In these cases, vendors must to give their users a simple way to export the relevant information into a diagnostic bundle to send to their team to perform disconnected troubleshooting.

Telemetry in Modern On-prem Applications

Most commercial software vendors need to get some amount of telemetry back from their modern on-prem deployments. This data is generally something that can be disabled by the end customer, but many choose to leave it on as it provides the vendor with crucial insights to their mutual success. At the start this is often some amount of usage data (DAU, WAU etc) on core product usage metrics. It often evolves into more application health information and insights into the underlying resources and infrastructure where the application is deployed.

Join the Community

If you’re interested in this topic (agree or disagree), we’d love to have you join the community.