Running applications in production can be tricky. This post proposes an opinionated checklist for going to production with a web service (i.e. application exposing HTTP API) on Kubernetes.
General
- Application’s name, description, purpose, and owning team is clearly documented (e.g. in a central application registry or wiki)
- Application’s criticality level was defined (e.g. “tier 1” if the app is highly critical for the business)
- Development team has sufficient knowledge/experience with the technology stack
- Responsible 24/7 on-call team is identified and informed
- Go-Live plan exists incl. steps for potential rollback
Application
- Application’s code repository (git) has clear instructions on how to develop, how to configure, and how to contribute changes (important for emergency fixes)
- Code dependencies are pinned (i.e. hotfix changes do not accidentally pull in new libraries)
- All relevant code is instrumented with OpenTracing or OpenTelemetry
- OpenTracing/OpenTelemetry semantic conventions are followed (incl. additional company conventions)
- All outgoing HTTP calls have a defined timeout
- HTTP connection pools are configured with sane values according to expected traffic
- Thread pools and/or non-blocking async code is correctly implemented/configured
- Database connection pools are sized correctly
- Retries and retry policies (e.g. backoff with jitter) are implemented for dependent services
- Circuit breakers are implemented
- Fallbacks for circuit breakers are defined according to business requirements
- Load shedding / rate limiting mechanisms are implemented (could be part of provided infrastructure)
- Application metrics are exposed for collection (e.g. to be scraped by Prometheus)
- Application logs go to stdout/stderr
- Application logs follow good practices (e.g. structured logging, meaningful messages), log levels are clearly defined, and debug logging is disabled for production by default (with option to turn on)
- Application container crashes on fatal errors (i.e. it does not enter some unrecoverable state or deadlock)
- Application design/code was reviewed by a senior/principal engineer
Security & Compliance
- Application can run as unprivileged user (non-root) / Use an immutable Operating System like CoreOS on Worker Nodes
- Application does not require a writable container filesystem (i.e. can be mounted read-only)
- HTTP requests are authenticated and authorized (e.g. using OAuth)
- Mechanisms to mitigate Denial Of Service (DOS) attacks are in place (e.g. ingress rate limiting, WAF)
- A security audit was conducted
- Automated vulnerability checks for code / dependencies are in place
- Processed data is understood, classified (e.g. PII), and documented
- Threat model was created and risks are documented
- Other applicable organizational rules and compliance standards are followed
CI/CD
- Automated code linting is run on every change
- Automated tests are part of the delivery pipeline
- No manual operations are needed for production deployments
- All relevant team members can deploy and rollback
- Production deployments have smoke tests and optionally automatic rollbacks
- Lead time from code commit to production is fast (e.g. 15 minutes or less including test runs)
Kubernetes
- Development team is trained in Kubernetes topics and knows relevant concepts
- Kubernetes manifests use the latest API version (e.g. apps/v1 for Deployment)
- Container runs as non-root and uses a read-only filesystem
- A proper Readiness Probe was defined (see blog post about Readiness/Liveness Probes )
- No Liveness Probe is used, or there is a clear rationale to use a Liveness Probe (see blog post about Readiness/Liveness Probes )
- Kubernetes deployment has at least two replicas
- A Pod Disruption Budget was defined (or is automatically created, e.g. by pdb-controller)
- Horizontal autoscaling (HPA) is configured if adequate
- Memory and CPU requests are set according to performance/load tests
- Memory limit equals memory requests (to avoid memory overcommit)
- CPU limits are not set or impact of CPU throttling is well understood
- Application is correctly configured for the container environment (e.g. JVM heap, single-threaded runtimes, runtimes not container-aware)
- Single application process runs per container
- Application can handle graceful shutdown and rolling updates without disruptions (see this blog post )
- Pod Lifecycle Hook (e.g. “sleep 20” in preStop) is used if the application does not handle graceful termination
- All required Pod labels are set (e.g. “application”, “component”, “environment”)
- Application is set up for high availability: pods are spread across failure domains (AZs, default behavior for cross-AZ clusters) and/or application is deployed to multiple clusters
- Kubernetes Service uses the right label selector for pods (e.g. not only matches the “application” label, but also “component” and “environment” for future extensibility)
- There are no anti-affinity rules defined, unless really required (pods are spread across failure domains by default)
- Optional: Tolerations are used as needed (e.g. to bind pods to a specific node pool)
See also this curated checklist of Kubernetes production best practices.
Monitoring
- Metrics for The Four Golden Signals are collected
- Application metrics are collected (e.g. via Prometheus scraping)
- Backing data store (e.g. PostgreSQL database) is monitored
- SLOs are defined
- Monitoring dashboards (e.g. Grafana) exist (could be automatically set up)
- Alerting rules are defined based on impact, not potential causes
Testing
- Breaking points were tested (system/chaos test)
- Load test was performed which reflects the expected traffic pattern
- Backup and restore of the data store (e.g. PostgreSQL database) was tested
24/7 On-Call
- All relevant 24/7 personnel is informed about the go-live (e.g. other teams, SREs, or other roles like incident commanders)
- 24/7 on-call team has sufficient knowledge about the application and business context
- 24/7 on-call team has necessary production access (e.g. kubectl, kube-web-view, application logs)
- 24/7 on-call team has expertise to troubleshoot production issues with the tech stack (e.g. JVM)
- 24/7 on-call team is trained and confident to perform standard operations (scale up, rollback, ..)
- Runbooks are defined for application-specific incident handling
- Runbooks for overload scenarios have pre-approved business decisions (e.g. what customer feature to disable to reduce load)
- Monitoring alerts to page the 24/7 on-call team are set up
- Automatic escalation rules are in place (e.g. page next level after 10 minutes without acknowledgement)
- Process for conducting postmortems and disseminating incident learnings exists
- Regular application/operational reviews are conducted (e.g. looking at SLO breaches)
Hits: 338