Portfolio | Platform Engineer

This is the exact checklist and decision log I use to upgrade production EKS with zero customer impact. It covers control plane and node groups, disruption controls, traffic management, observability, and rollbacks.

Objectives

Keep SLOs and error budgets intact during upgrades
Zero downtime perceived by end users
Reversible at every stage
Auditable and repeatable via IaC + CI

Pre‑flight: What to Check Before You Touch Anything

Kubernetes Version Skew
- Control plane N, nodes N or N−1. Kubelet must not be newer than API server.
- Core add‑ons (VPC CNI, CoreDNS, kube-proxy) must be compatible with target.
Supported Targets
- From N−2 to N recommended. If you’re on N−3, plan two hops (test each).
Cluster Health Baseline
- All Ready nodes, PDB violations = 0, no CrashLoopBackOff, API latency < SLO.
- Save current metrics and dashboards for A/B diff after upgrade.
Capacity Headroom
- Ensure 20–30% spare to absorb surges during drains/rollouts.
Disruption Controls
- PodDisruptionBudgets (PDB) exist for critical apps.
- Deployments use RollingUpdate with surge (>=1 or >=25%) and maxUnavailable=0 for strict zero-downtime where applicable.
Maintenance Window & Comms
- Stakeholders notified. Feature freezes where needed. Incident channel ready.

Architecture Choices for Zero Downtime

Blue/Green Node Groups (Recommended)
- Create a new managed node group (or Karpenter provisioner) on target AMI + kubelet.
- Shift workloads gradually with cordon+drain or node taints.
- Tear down old after verification. Rollback = shift traffic back.
In‑Place Rolling (Works, less flexible)
- Update Launch Template to new AMI, set max surge, replace nodes gradually.

Step 1: Control Plane Upgrade

Read compatibility matrix (EKS release notes).
Bump EKS control plane to target in the console/CLI/IaC (Terraform aws_eks_cluster.version).
Wait for Active; verify API stability:
- kubectl version --short
- kubectl get --raw='/readyz?verbose'
Update core add‑ons (managed add-ons preferred):
- VPC CNI, CoreDNS, kube-proxy to versions compatible with target.

Control plane is HA; this step is normally hitless.

Step 2: Data Plane Strategy (Blue/Green)

Create new node group ng-green with target AMI (Bottlerocket/AL2) and same labels/taints as current ng-blue.
Scale up new group to baseline capacity + surge buffer.
Cordon and drain old nodes incrementally:
- kubectl cordon <node>
- kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
Observe rollouts
- PDB respected, HPA targets stable, SLOs intact.
Traffic management
- For Ingress (NLB/ALB/Ingress-Nginx): ensure enough replicas across zones before draining.
- For gRPC/HTTP apps: verify readiness gates and preStop hooks (sleep 5–15s) to drain connections.
After workload steady state on green, scale old group to 0 and delete.

Rollback: re‑scale old group or revert Launch Template and re‑drain in reverse.

Step 3: Add‑ons & CRDs

CNI/CNI metrics: upgrade first with the control plane.
Observability Stack: Prometheus Operator, Grafana, OpenTelemetry Collector – upgrade charts with compatibility pinning.
Service Mesh (if any): Follow mesh‑specific skew and drain guides (Linkerd/Istio).
CRDs: Apply new CRDs before updating controllers.

Disruption & Rollout Controls

PodDisruptionBudgets example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: web

Deployment strategy example:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 25%

Graceful termination in pod spec:

terminationGracePeriodSeconds: 30
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

Node drain safety
- Respect PDBs; use --disable-eviction=false (default) so evictions honor PDB.
- Exempt DaemonSets; ensure they tolerate both old/new nodes.

Observability & Validation

Before snapshot
- Error rates, p95/p99 latency, 5xx/4xx, Kafka lag, queue depth, HPA metrics.
During
- Watch kube_events_total, container_ready, workqueue_depth, Ingress/NLB target health.
After
- Compare baseline; run smoke tests for core use cases.

Example smoke test (GitHub Actions):

- name: Smoke test homepage
  run: |
    code=$(curl -s -o /dev/null -w "%{http_code}" https://example.com/health)
    if [ "$code" != "200" ]; then exit 1; fi

Rollback Plans

Control plane: EKS does not support automatic downgrade; rollback = restore cluster from backup, or keep a parallel cluster for blue/green cluster cutover.
Data plane: Scale old node group back up and drain in reverse.
Add‑ons: Reapply previous chart versions.

Always keep infra code versioned (Git) and artifact versions pinned for atomic revert.

Automation with Terraform & CI

Terraform modules
- Pin aws_eks_cluster.version, separate module for managed node groups with distinct LT/AMI.
Pipelines
- Stages: plan → approvals → control plane → add‑ons → node groups → app canaries → full drain → post‑checks.
GitOps Add‑on Upgrades
- Use Argo CD/Flux with helm chart pinning; upgrade CRDs first.

Mermaid overview:

flowchart TD
  A(Plan & Health Baseline) --> B(Control Plane Upgrade)
  B --> C(Add-on Upgrades)
  C --> D(New Node Group - Green)
  D --> E(Cordon & Drain Old - Blue)
  E --> F(Observability & Smoke Tests)
  F --> G(Delete Old Group)
  F -->|Rollback| D

Edge Cases & Gotchas

Jobs & CronJobs: ensure backoff limits and concurrency policies don’t spike during drains.
StatefulSets: set PDB carefully; shard awareness for DB/brokers; consider maintenance mode per shard.
Local emptyDir: drained with --delete-emptydir-data; verify app behavior.
Privileged/DaemonSets: CNI, CSI, log agents must support both old/new kernels.
Version Skew with Ingress Controllers: upgrade controller image and CRDs before draining nodes.

Final Checklist

Control plane upgraded and stable
Managed add‑ons on compatible versions
New node group (green) created and receiving traffic
Old group drained without PDB violations
SLOs unchanged pre vs post
Runbook updated; tickets closed with artifacts and dashboards

TL;DR Runbook

Bump control plane → wait green
Upgrade VPC CNI, CoreDNS, kube-proxy
Create green node group on target AMI
Cordon+drain blue nodes gradually (respect PDB)
Verify metrics, run smoke tests
Delete blue group
Post‑checks, document, tag infra

This is the exact checklist and decision log I use to upgrade production EKS with zero customer impact. It covers control plane and node groups, disruption controls, traffic management, observability, and rollbacks.

Objectives

Keep SLOs and error budgets intact during upgrades
Zero downtime perceived by end users
Reversible at every stage
Auditable and repeatable via IaC + CI

Pre‑flight: What to Check Before You Touch Anything

Kubernetes Version Skew
- Control plane N, nodes N or N−1. Kubelet must not be newer than API server.
- Core add‑ons (VPC CNI, CoreDNS, kube-proxy) must be compatible with target.
Supported Targets
- From N−2 to N recommended. If you’re on N−3, plan two hops (test each).
Cluster Health Baseline
- All Ready nodes, PDB violations = 0, no CrashLoopBackOff, API latency < SLO.
- Save current metrics and dashboards for A/B diff after upgrade.
Capacity Headroom
- Ensure 20–30% spare to absorb surges during drains/rollouts.
Disruption Controls
- PodDisruptionBudgets (PDB) exist for critical apps.
- Deployments use RollingUpdate with surge (>=1 or >=25%) and maxUnavailable=0 for strict zero-downtime where applicable.
Maintenance Window & Comms
- Stakeholders notified. Feature freezes where needed. Incident channel ready.

Architecture Choices for Zero Downtime

Blue/Green Node Groups (Recommended)
- Create a new managed node group (or Karpenter provisioner) on target AMI + kubelet.
- Shift workloads gradually with cordon+drain or node taints.
- Tear down old after verification. Rollback = shift traffic back.
In‑Place Rolling (Works, less flexible)
- Update Launch Template to new AMI, set max surge, replace nodes gradually.

Step 1: Control Plane Upgrade

Read compatibility matrix (EKS release notes).
Bump EKS control plane to target in the console/CLI/IaC (Terraform aws_eks_cluster.version).
Wait for Active; verify API stability:
- kubectl version --short
- kubectl get --raw='/readyz?verbose'
Update core add‑ons (managed add-ons preferred):
- VPC CNI, CoreDNS, kube-proxy to versions compatible with target.

Control plane is HA; this step is normally hitless.

Step 2: Data Plane Strategy (Blue/Green)

Create new node group ng-green with target AMI (Bottlerocket/AL2) and same labels/taints as current ng-blue.
Scale up new group to baseline capacity + surge buffer.
Cordon and drain old nodes incrementally:
- kubectl cordon <node>
- kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
Observe rollouts
- PDB respected, HPA targets stable, SLOs intact.
Traffic management
- For Ingress (NLB/ALB/Ingress-Nginx): ensure enough replicas across zones before draining.
- For gRPC/HTTP apps: verify readiness gates and preStop hooks (sleep 5–15s) to drain connections.
After workload steady state on green, scale old group to 0 and delete.

Rollback: re‑scale old group or revert Launch Template and re‑drain in reverse.

Step 3: Add‑ons & CRDs

CNI/CNI metrics: upgrade first with the control plane.
Observability Stack: Prometheus Operator, Grafana, OpenTelemetry Collector – upgrade charts with compatibility pinning.
Service Mesh (if any): Follow mesh‑specific skew and drain guides (Linkerd/Istio).
CRDs: Apply new CRDs before updating controllers.

Disruption & Rollout Controls

PodDisruptionBudgets example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: web

Deployment strategy example:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 25%

Graceful termination in pod spec:

terminationGracePeriodSeconds: 30
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

Node drain safety
- Respect PDBs; use --disable-eviction=false (default) so evictions honor PDB.
- Exempt DaemonSets; ensure they tolerate both old/new nodes.

Observability & Validation

Before snapshot
- Error rates, p95/p99 latency, 5xx/4xx, Kafka lag, queue depth, HPA metrics.
During
- Watch kube_events_total, container_ready, workqueue_depth, Ingress/NLB target health.
After
- Compare baseline; run smoke tests for core use cases.

Example smoke test (GitHub Actions):

- name: Smoke test homepage
  run: |
    code=$(curl -s -o /dev/null -w "%{http_code}" https://example.com/health)
    if [ "$code" != "200" ]; then exit 1; fi

Rollback Plans

Control plane: EKS does not support automatic downgrade; rollback = restore cluster from backup, or keep a parallel cluster for blue/green cluster cutover.
Data plane: Scale old node group back up and drain in reverse.
Add‑ons: Reapply previous chart versions.

Always keep infra code versioned (Git) and artifact versions pinned for atomic revert.

Automation with Terraform & CI

Terraform modules
- Pin aws_eks_cluster.version, separate module for managed node groups with distinct LT/AMI.
Pipelines
- Stages: plan → approvals → control plane → add‑ons → node groups → app canaries → full drain → post‑checks.
GitOps Add‑on Upgrades
- Use Argo CD/Flux with helm chart pinning; upgrade CRDs first.

Mermaid overview:

flowchart TD
  A(Plan & Health Baseline) --> B(Control Plane Upgrade)
  B --> C(Add-on Upgrades)
  C --> D(New Node Group - Green)
  D --> E(Cordon & Drain Old - Blue)
  E --> F(Observability & Smoke Tests)
  F --> G(Delete Old Group)
  F -->|Rollback| D

Edge Cases & Gotchas

Jobs & CronJobs: ensure backoff limits and concurrency policies don’t spike during drains.
StatefulSets: set PDB carefully; shard awareness for DB/brokers; consider maintenance mode per shard.
Local emptyDir: drained with --delete-emptydir-data; verify app behavior.
Privileged/DaemonSets: CNI, CSI, log agents must support both old/new kernels.
Version Skew with Ingress Controllers: upgrade controller image and CRDs before draining nodes.

Final Checklist

Control plane upgraded and stable
Managed add‑ons on compatible versions
New node group (green) created and receiving traffic
Old group drained without PDB violations
SLOs unchanged pre vs post
Runbook updated; tickets closed with artifacts and dashboards

TL;DR Runbook

Bump control plane → wait green
Upgrade VPC CNI, CoreDNS, kube-proxy
Create green node group on target AMI
Cordon+drain blue nodes gradually (respect PDB)
Verify metrics, run smoke tests
Delete blue group
Post‑checks, document, tag infra

EKS In‑Place Upgrades with Zero Downtime: A Senior Engineer’s Playbook

Objectives

Pre‑flight: What to Check Before You Touch Anything

Architecture Choices for Zero Downtime

Step 1: Control Plane Upgrade

Step 2: Data Plane Strategy (Blue/Green)

Step 3: Add‑ons & CRDs

Disruption & Rollout Controls

Observability & Validation

Rollback Plans

Automation with Terraform & CI

Edge Cases & Gotchas

Final Checklist

TL;DR Runbook

EKS In‑Place Upgrades with Zero Downtime: A Senior Engineer’s Playbook

Objectives

Pre‑flight: What to Check Before You Touch Anything

Architecture Choices for Zero Downtime

Step 1: Control Plane Upgrade

Step 2: Data Plane Strategy (Blue/Green)

Step 3: Add‑ons & CRDs

Disruption & Rollout Controls

Observability & Validation

Rollback Plans

Automation with Terraform & CI

Edge Cases & Gotchas

Final Checklist

TL;DR Runbook