Disconnected Troubleshooting

Whether delivering to an internet-connected cloud VPC or a fully-airgapped datacenter it’s important to develop competency in disconnected troubleshooting. That is, the ability to troubleshoot an application when application maintainers don’t have direct access to the underlying infrastructure.

Why not connected troubleshooting

Often software vendors will allow users to deploy on-premise, on the condition that the vendor receives persistent shell/kubectl access to the infrastructure. As discussed in the Vendor FAQ, it’s inadvisable to rely on the ability to directly install and troubleshoot on-prem applications. While doing so would remove the burden of installation of the maintenance from the end customer, it also requires them to give a lot more trust to the software vendor, undermining the core data security benefits of Modern On-prem delivery, particularly for high-risk data.

Declarative Disconnected Troubleshooting

One of the core selling points of Modern On-prem is that end-customers won’t need to become experts in each OTS application they’re deploying. When things go wrong, sometimes exposing metrics, logs, and traces to an end customer won’t be sufficient to enable users to self-diagnose the issue. In these cases, vendors should to give users a simple way to export the relevant information into a diagnostic bundle to send back to the vendor team. Tools used for this should have the following characteristics:

End users should be able to click a single button in a web UI or run a single shell command to export all relevant debugging information, including logs, metrics, and Kubernetes object descriptions.
Exported data should be redacted automatically. E.g. environment variables and secrets that appear to contain password or tokens should be scrubbed from the bundle.
When generating a diagnostic bundle, the default behavior should be to only expose this bundle to the end customer. While it’s valuable to enable an end customer to easily send a bundle to a vendor team, it’s almost always preferable to give end users the opportunity to review it and perform their own redaction before sharing it.

Doing this well will drastically reduce the amount of time spent supporting enterprise users, and will minimize the amount of back-and-forth when debugging with the end user.

Developing the capability to evolve this troubleshooting process dynamically means building or leveraging a declarative framework. While a bash script to run a bunch of kubectl commands might do in a pinch, such scripts can quickly become unwieldy and difficult to maintain. As problems occur, vendors will want a way to evolve these tools, adding new commands and files to be collected as part of the bundle. Having app developers maintain a structured YAML or JSON file that defines what to collect scales more cleanly than having them collaborate on a bash/python/ruby/nodejs script. The most common things to collect in a KOTS application are:

Collecting kubectl get -o json output for Deployments, Services, Daemonsets, etc. in an application
Collecting stdout/stderr logs from application containers
Copying files from inside application pods
Stdout results of executing commands inside application containers (e.g. pinging an internal healthcheck endpoint, generating a diagnostic dump for an embedded database)
Performing HTTP and GRPC requests against service endpoints from outside the pod network (to test service/ingress routing)

Analyzing diagnostic information

While having a diagnostic bundle is valuable on its own, the best way to improve the efficiency of disconnected troubleshooting is to leverage tooling to analyze the collected information. Rather than poring through log files every time something goes wrong, application vendors should focus on building tools to detect and surface common problems. Such problems range from widespread issues like under-provisioned hardware or lack of disk space all the way to application specific failures like specific warning messages. An analysis framework can search through logs and detect every common problem that a support team has ever encountered, ensuring that time is not wasted investigating problems that have already been resolved in the past. When properly implemented, Analysis tools can vastly reduce not only the time-to-recovery, but also the amount of support resources teams needs to deploy in general.

Implementation

Such an analysis framework really shines when it can be embedded it into an application, allowing the end user to self-serve through common problems without installing additional tools. For this reason, vendors should use the same languages and frameworks to build their internal tooling for analysis as their core team uses when developing application(s). For example, if an app frontend is written in React, any internal visualization tools for troubleshooting should also be built in React, so that teams have the option to embed the same diagnostic UI in the end-user’s application UI. For a kubernetes-based implementation, tools like troubleshoot and kots can provide a lot of deep collection and analysis with minimal work from engineering teams.

Benefits

Fully leveraging an automated analysis framework on top of diagnostic collection empowers end customers and tier 1 support teams to resolve more issues without escalating. Furthermore, vendors may find that building such capabilities for disconnected troubleshooting allows them to diagnose issues even faster than when direct access to instances is provided.