This is the exact checklist and decision log I use to upgrade production EKS with zero customer impact. It covers control plane and node groups, disruption controls, traffic management, observability, and rollbacks.
Objectives
- Keep SLOs and error budgets intact during upgrades
- Zero downtime perceived by end users
- Reversible at every stage
- Auditable and repeatable via IaC + CI
Pre‑flight: What to Check Before You Touch Anything
- Kubernetes Version Skew
- Control plane N, nodes N or N−1. Kubelet must not be newer than API server.
- Core add‑ons (VPC CNI, CoreDNS, kube-proxy) must be compatible with target.
- Supported Targets
- From N−2 to N recommended. If you’re on N−3, plan two hops (test each).
- Cluster Health Baseline
- All
Readynodes, PDB violations = 0, no CrashLoopBackOff, API latency < SLO. - Save current metrics and dashboards for A/B diff after upgrade.
- All
- Capacity Headroom
- Ensure 20–30% spare to absorb surges during drains/rollouts.
- Disruption Controls
- PodDisruptionBudgets (PDB) exist for critical apps.
- Deployments use RollingUpdate with surge (>=1 or >=25%) and maxUnavailable=0 for strict zero-downtime where applicable.
- Maintenance Window & Comms
- Stakeholders notified. Feature freezes where needed. Incident channel ready.
Architecture Choices for Zero Downtime
- Blue/Green Node Groups (Recommended)
- Create a new managed node group (or Karpenter provisioner) on target AMI + kubelet.
- Shift workloads gradually with cordon+drain or node taints.
- Tear down old after verification. Rollback = shift traffic back.
- In‑Place Rolling (Works, less flexible)
- Update Launch Template to new AMI, set max surge, replace nodes gradually.
Step 1: Control Plane Upgrade
- Read compatibility matrix (EKS release notes).
- Bump EKS control plane to target in the console/CLI/IaC (Terraform
aws_eks_cluster.version). - Wait for Active; verify API stability:
kubectl version --shortkubectl get --raw='/readyz?verbose'
- Update core add‑ons (managed add-ons preferred):
- VPC CNI, CoreDNS, kube-proxy to versions compatible with target.
Control plane is HA; this step is normally hitless.
Step 2: Data Plane Strategy (Blue/Green)
- Create new node group
ng-greenwith target AMI (Bottlerocket/AL2) and same labels/taints as currentng-blue. - Scale up new group to baseline capacity + surge buffer.
- Cordon and drain old nodes incrementally:
kubectl cordon <node>kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
- Observe rollouts
- PDB respected, HPA targets stable, SLOs intact.
- Traffic management
- For Ingress (NLB/ALB/Ingress-Nginx): ensure enough replicas across zones before draining.
- For gRPC/HTTP apps: verify readiness gates and preStop hooks (sleep 5–15s) to drain connections.
- After workload steady state on green, scale old group to 0 and delete.
Rollback: re‑scale old group or revert Launch Template and re‑drain in reverse.
Step 3: Add‑ons & CRDs
- CNI/CNI metrics: upgrade first with the control plane.
- Observability Stack: Prometheus Operator, Grafana, OpenTelemetry Collector – upgrade charts with compatibility pinning.
- Service Mesh (if any): Follow mesh‑specific skew and drain guides (Linkerd/Istio).
- CRDs: Apply new CRDs before updating controllers.
Disruption & Rollout Controls
- PodDisruptionBudgets example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 80%
selector:
matchLabels:
app: web
- Deployment strategy example:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 25%
- Graceful termination in pod spec:
terminationGracePeriodSeconds: 30
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
- Node drain safety
- Respect PDBs; use
--disable-eviction=false(default) so evictions honor PDB. - Exempt DaemonSets; ensure they tolerate both old/new nodes.
- Respect PDBs; use
Observability & Validation
- Before snapshot
- Error rates, p95/p99 latency, 5xx/4xx, Kafka lag, queue depth, HPA metrics.
- During
- Watch
kube_events_total,container_ready,workqueue_depth, Ingress/NLB target health.
- Watch
- After
- Compare baseline; run smoke tests for core use cases.
Example smoke test (GitHub Actions):
- name: Smoke test homepage
run: |
code=$(curl -s -o /dev/null -w "%{http_code}" https://example.com/health)
if [ "$code" != "200" ]; then exit 1; fi
Rollback Plans
- Control plane: EKS does not support automatic downgrade; rollback = restore cluster from backup, or keep a parallel cluster for blue/green cluster cutover.
- Data plane: Scale old node group back up and drain in reverse.
- Add‑ons: Reapply previous chart versions.
Always keep infra code versioned (Git) and artifact versions pinned for atomic revert.
Automation with Terraform & CI
- Terraform modules
- Pin
aws_eks_cluster.version, separate module for managed node groups with distinct LT/AMI.
- Pin
- Pipelines
- Stages: plan → approvals → control plane → add‑ons → node groups → app canaries → full drain → post‑checks.
- GitOps Add‑on Upgrades
- Use Argo CD/Flux with helm chart pinning; upgrade CRDs first.
Mermaid overview:
flowchart TD
A(Plan & Health Baseline) --> B(Control Plane Upgrade)
B --> C(Add-on Upgrades)
C --> D(New Node Group - Green)
D --> E(Cordon & Drain Old - Blue)
E --> F(Observability & Smoke Tests)
F --> G(Delete Old Group)
F -->|Rollback| D
Edge Cases & Gotchas
- Jobs & CronJobs: ensure backoff limits and concurrency policies don’t spike during drains.
- StatefulSets: set PDB carefully; shard awareness for DB/brokers; consider maintenance mode per shard.
- Local emptyDir: drained with
--delete-emptydir-data; verify app behavior. - Privileged/DaemonSets: CNI, CSI, log agents must support both old/new kernels.
- Version Skew with Ingress Controllers: upgrade controller image and CRDs before draining nodes.
Final Checklist
- Control plane upgraded and stable
- Managed add‑ons on compatible versions
- New node group (green) created and receiving traffic
- Old group drained without PDB violations
- SLOs unchanged pre vs post
- Runbook updated; tickets closed with artifacts and dashboards
TL;DR Runbook
- Bump control plane → wait green
- Upgrade VPC CNI, CoreDNS, kube-proxy
- Create green node group on target AMI
- Cordon+drain blue nodes gradually (respect PDB)
- Verify metrics, run smoke tests
- Delete blue group
- Post‑checks, document, tag infra