Reliability Questionnaire

Vendor Reliability Questionnaire

Goal: Produce a basic questionnaire that an enterprise could send to an application vendor that wants to provide them with a Modern On-prem application. This is designed to be similar to the Vendor Security Questionnaires that are in use today.

Prerequisites & Provisioning

Questions in this section are designed to help the enterprise understand the requirements of the application, and if the application can be deployed to existing hardware, new “standard” hardware (the enterprise defines “standard”), or requires anything unusual for the enterprise. This includes both hardware, non-bundled software, and training of systems and processes.

  1. Does the application require specialized hardware?
  2. Does the application run on Linux, Windows or either?
  3. Will the application run on a security-hardened Linux?
  4. Can the application run on any cloud provider? List the cloud providers it can run on.
  5. Does the application require specific cloud provider resources (Redshift, ECS, etc.)?
  6. Can the application run on Kubernetes?
  7. Can the application run in any namespace?
  8. What RBAC permissions will be required outside of the namespace?

Installing & Configuring

Questions in this section get a little more specific about how the application is delivered. This section is designed to help the enterprise understand what the vendor is responsible for and what the enterprise will be responsible for. Additionally, it helps the enterprise start to understand how they will be able to edit the installable assets to make them compatible with existing systems (networking, workflows, etc.).

  1. What methods can be used to add secrets at configuration time?
  2. What secret stores does the app support at runtime?
  3. Does the application create its own TLS certs?
  4. Will the application run with certs generated from an enterprise CA?
  5. How does the application handle renewal of TLS certs?
  6. If K8s, is the K8s YAML exposed for editing or patching?
  7. Is the SecurityContext configurable?
  8. Is it possible to tune liveness and readiness probes?

Updating & Upgrading

Questions in this section help the enterprise understand the cadence and process (i.e., effort) that will be involved with each update. Frequency of updates and delivery mechanism will build on questions asked before, and the goal of questions here is to help the enterprise understand and plan for resources required to keep the software up to date.

  1. Zero downtime updates?
  2. How will updates be applied?
  3. How are update notifications handled? Git PR?
  4. How are container images delivered?
  5. Can images be retagged and pushed to a local registry for security scanning?

Operating

Questions in this section help the enterprise understand the day-to-day operational tasks they will be responsible for managing. All software requires some routine operational tasks, and some can be delivered as automated tasks, while some are manual efforts. When planning for resources required to run software, the enterprise should budget for any routine (daily, weekly, monthly) manual tasks that require intervention to keep the system running properly.

  1. What regular tasks are required to maintain steady-state operation of the system?
  2. What is required for database maintenance?
  3. How do you facilitate user cleanup?
  4. Are there caches that must be purged manually?

Inter Service Communication

  1. Do all internal api interfaces require authentication and authorization? Communications leveraging MTLS? Leveraging a service mesh?
  2. Does your application operate with the principle of least privilege? - Always create a user, never run as root?

Testing

  1. What automation scripts can be provided with the application to load test after installation?
  2. How is an installation verified to be running properly?
  3. What automated post-installation conformance testing tools are provided?

High Availability

  1. Does the application support multi-region deployments?
  2. Does the application support multi-cloud deployments?
  3. Which components cannot be clustered or rely on single instance?

Observability (Monitoring, Logging, Tracing)

  1. What logging destinations (file, stdout/err, etc) does the application support?
  2. What volume of logging data is generated by the application?
  3. Does the application support structured logs?
  4. Does the application support unstructured logs?
  5. Are secrets and PII and sensitive information excluded from logs?
  6. What’s the default log level of the components?
  7. Can the log level be changed without restarting the application?

Storage & State

  1. Does the application support BYO State (db, object, block)?
  2. How is the database configured (db engine parameters)?
  3. How are schema migrations applied to the database during upgrades? Can this be rolled back? Are these idempotent? Do these require downtime?
  4. How are data migrations handled in the application?
  5. Where is all data stored?
  6. What data leaves the deployed system and is sent to third parties (including the vendor)?
  7. Does the application require shared access to block storage?

Troubleshooting

  1. How will you troubleshoot the system without remote access if something isn’t working properly?
  2. Can recent diagnostic information be collected if an error or downtime event is reported?

Networking

  1. What ingress options are required?
  2. Which services require ingress, or IP/DNS that are accessible from outside the cluster?
  3. Does the application require a subdomain or can it be deployed on a path?
  4. What external endpoints are required to be accessible for the application to start?
  5. What external endpoints are required to be accessible for the application to run?

Join the Community

If you’re interested in this topic (agree or disagree), we’d love to have you join the community.