Akhil
Home
Experience
Achievements
Blog
Tools
Contact
Resume
Home
Experience
Achievements
Blog
Tools
Contact
Resume
Portfolio

Building robust and scalable cloud-native solutions with modern DevOps practices.

Navigation

  • Home
  • Experience
  • Achievements
  • Blog
  • Tools
  • Contact

Get in Touch

akhil.alakanty@gmail.com

+1 (248) 787-9406

Austin, TX

GitHub
LinkedIn
Twitter
Email

© 2025 Akhil Reddy. All rights reserved.

Built with Next.js, Tailwind CSS, and Framer Motion. Deployed on Vercel.

    EKS In‑Place Upgrades with Zero Downtime: A Senior Engineer’s Playbook

    10/1/2025, 8:00:00 PM
    kuberneteseksdevopssreplatform-engineeringzero-downtimeupgrade

    This is the exact checklist and decision log I use to upgrade production EKS with zero customer impact. It covers control plane and node groups, disruption controls, traffic management, observability, and rollbacks.

    Objectives

    • Keep SLOs and error budgets intact during upgrades
    • Zero downtime perceived by end users
    • Reversible at every stage
    • Auditable and repeatable via IaC + CI

    Pre‑flight: What to Check Before You Touch Anything

    • Kubernetes Version Skew
      • Control plane N, nodes N or N−1. Kubelet must not be newer than API server.
      • Core add‑ons (VPC CNI, CoreDNS, kube-proxy) must be compatible with target.
    • Supported Targets
      • From N−2 to N recommended. If you’re on N−3, plan two hops (test each).
    • Cluster Health Baseline
      • All Ready nodes, PDB violations = 0, no CrashLoopBackOff, API latency < SLO.
      • Save current metrics and dashboards for A/B diff after upgrade.
    • Capacity Headroom
      • Ensure 20–30% spare to absorb surges during drains/rollouts.
    • Disruption Controls
      • PodDisruptionBudgets (PDB) exist for critical apps.
      • Deployments use RollingUpdate with surge (>=1 or >=25%) and maxUnavailable=0 for strict zero-downtime where applicable.
    • Maintenance Window & Comms
      • Stakeholders notified. Feature freezes where needed. Incident channel ready.

    Architecture Choices for Zero Downtime

    • Blue/Green Node Groups (Recommended)
      • Create a new managed node group (or Karpenter provisioner) on target AMI + kubelet.
      • Shift workloads gradually with cordon+drain or node taints.
      • Tear down old after verification. Rollback = shift traffic back.
    • In‑Place Rolling (Works, less flexible)
      • Update Launch Template to new AMI, set max surge, replace nodes gradually.

    Step 1: Control Plane Upgrade

    1. Read compatibility matrix (EKS release notes).
    2. Bump EKS control plane to target in the console/CLI/IaC (Terraform aws_eks_cluster.version).
    3. Wait for Active; verify API stability:
      • kubectl version --short
      • kubectl get --raw='/readyz?verbose'
    4. Update core add‑ons (managed add-ons preferred):
      • VPC CNI, CoreDNS, kube-proxy to versions compatible with target.

    Control plane is HA; this step is normally hitless.


    Step 2: Data Plane Strategy (Blue/Green)

    1. Create new node group ng-green with target AMI (Bottlerocket/AL2) and same labels/taints as current ng-blue.
    2. Scale up new group to baseline capacity + surge buffer.
    3. Cordon and drain old nodes incrementally:
      • kubectl cordon <node>
      • kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
    4. Observe rollouts
      • PDB respected, HPA targets stable, SLOs intact.
    5. Traffic management
      • For Ingress (NLB/ALB/Ingress-Nginx): ensure enough replicas across zones before draining.
      • For gRPC/HTTP apps: verify readiness gates and preStop hooks (sleep 5–15s) to drain connections.
    6. After workload steady state on green, scale old group to 0 and delete.

    Rollback: re‑scale old group or revert Launch Template and re‑drain in reverse.


    Step 3: Add‑ons & CRDs

    • CNI/CNI metrics: upgrade first with the control plane.
    • Observability Stack: Prometheus Operator, Grafana, OpenTelemetry Collector – upgrade charts with compatibility pinning.
    • Service Mesh (if any): Follow mesh‑specific skew and drain guides (Linkerd/Istio).
    • CRDs: Apply new CRDs before updating controllers.

    Disruption & Rollout Controls

    • PodDisruptionBudgets example:
    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: web-pdb
    spec:
      minAvailable: 80%
      selector:
        matchLabels:
          app: web
    
    • Deployment strategy example:
    strategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 0
        maxSurge: 25%
    
    • Graceful termination in pod spec:
    terminationGracePeriodSeconds: 30
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]
    
    • Node drain safety
      • Respect PDBs; use --disable-eviction=false (default) so evictions honor PDB.
      • Exempt DaemonSets; ensure they tolerate both old/new nodes.

    Observability & Validation

    • Before snapshot
      • Error rates, p95/p99 latency, 5xx/4xx, Kafka lag, queue depth, HPA metrics.
    • During
      • Watch kube_events_total, container_ready, workqueue_depth, Ingress/NLB target health.
    • After
      • Compare baseline; run smoke tests for core use cases.

    Example smoke test (GitHub Actions):

    - name: Smoke test homepage
      run: |
        code=$(curl -s -o /dev/null -w "%{http_code}" https://example.com/health)
        if [ "$code" != "200" ]; then exit 1; fi
    

    Rollback Plans

    • Control plane: EKS does not support automatic downgrade; rollback = restore cluster from backup, or keep a parallel cluster for blue/green cluster cutover.
    • Data plane: Scale old node group back up and drain in reverse.
    • Add‑ons: Reapply previous chart versions.

    Always keep infra code versioned (Git) and artifact versions pinned for atomic revert.


    Automation with Terraform & CI

    • Terraform modules
      • Pin aws_eks_cluster.version, separate module for managed node groups with distinct LT/AMI.
    • Pipelines
      • Stages: plan → approvals → control plane → add‑ons → node groups → app canaries → full drain → post‑checks.
    • GitOps Add‑on Upgrades
      • Use Argo CD/Flux with helm chart pinning; upgrade CRDs first.

    Mermaid overview:

    flowchart TD
      A(Plan & Health Baseline) --> B(Control Plane Upgrade)
      B --> C(Add-on Upgrades)
      C --> D(New Node Group - Green)
      D --> E(Cordon & Drain Old - Blue)
      E --> F(Observability & Smoke Tests)
      F --> G(Delete Old Group)
      F -->|Rollback| D
    

    Edge Cases & Gotchas

    • Jobs & CronJobs: ensure backoff limits and concurrency policies don’t spike during drains.
    • StatefulSets: set PDB carefully; shard awareness for DB/brokers; consider maintenance mode per shard.
    • Local emptyDir: drained with --delete-emptydir-data; verify app behavior.
    • Privileged/DaemonSets: CNI, CSI, log agents must support both old/new kernels.
    • Version Skew with Ingress Controllers: upgrade controller image and CRDs before draining nodes.

    Final Checklist

    • Control plane upgraded and stable
    • Managed add‑ons on compatible versions
    • New node group (green) created and receiving traffic
    • Old group drained without PDB violations
    • SLOs unchanged pre vs post
    • Runbook updated; tickets closed with artifacts and dashboards

    TL;DR Runbook

    1. Bump control plane → wait green
    2. Upgrade VPC CNI, CoreDNS, kube-proxy
    3. Create green node group on target AMI
    4. Cordon+drain blue nodes gradually (respect PDB)
    5. Verify metrics, run smoke tests
    6. Delete blue group
    7. Post‑checks, document, tag infra