The Million-Dollar Typo
A developer forgets to set a memory limit on a new service. During a spike in traffic, the service consumes all the memory on its node, causing the node to crash and taking down dozens of other critical services with it. This isn't a hypothetical scenario; it's a real and costly problem in the Kubernetes world.
Humans make mistakes. Relying on wiki pages, manual reviews, and tribal knowledge to enforce best practices is a losing battle. We needed automated, preventative controls—guardrails that make it impossible to do the wrong thing.
The Gatekeeper: How Admission Controllers Work
This is where a Kubernetes admission controller like Kyverno comes in. It intercepts every request sent to the Kubernetes API server before it's persisted. It acts as a gatekeeper, validating requests against a set of policies you define.
Here’s the flow:
sequenceDiagram
participant Dev as Developer
participant API as K8s API Server
participant Kyverno as Kyverno Webhook
Dev->>+API: `kubectl apply -f deployment.yaml`
API->>+Kyverno: Validate Request
alt Policy Pass
Kyverno-->>-API: Allow
API-->>-Dev: Success
else Policy Fail
Kyverno-->>-API: Deny (with reason)
API-->>-Dev: Error: 'image using :latest tag is not allowed'
end
Policy-as-Code in Action
With Kyverno, policies are just simple Kubernetes resources. This means you can store them in Git, version them, and manage them with the same GitOps tools you use for your applications.
Here are a few of the essential policies we implemented:
1. Require Resource Requests and Limits: This policy prevents resource-hogging applications from crashing nodes. It rejects any Pod that doesn't explicitly define its CPU and memory needs.
# require-requests-and-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-requests-limits
spec:
validationFailureAction: enforce
rules:
- name: check-for-requests-and-limits
match:
resources:
kinds:
- Pod
validate:
message: "CPU and memory requests and limits are required."
pattern:
spec:
containers:
- resources:
requests:
memory: "?*"
cpu: "?*"
limits:
memory: "?*"
cpu: "?*"
2. Disallow the :latest Image Tag: Using :latest is a bad practice that leads to unpredictable deployments. This policy forces developers to use immutable, versioned tags.
# disallow-latest-tag.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-latest-tag
spec:
validationFailureAction: enforce
rules:
- name: check-for-latest-tag
match:
resources:
kinds:
- Pod
validate:
message: "Using the ':latest' tag is not allowed. Please use a specific version."
pattern:
spec:
containers:
- image: "!*:latest"
From Audit to Enforcement
A key feature of Kyverno is the ability to deploy policies in audit mode first. This allows you to see what would have been blocked without actually breaking anyone's workflow. Once you're confident the policy is correct, you can flip the validationFailureAction to enforce to activate the guardrail.
The Impact: A Safer, More Stable Cluster
- Prevented Incidents: We completely eliminated entire classes of production incidents tied to resource starvation and unpredictable image pulls.
- Cost Savings: Enforcing resource limits stopped runaway pods from wasting CPU and memory, leading to a noticeable reduction in our cloud bill.
- Developer Education: The immediate feedback from Kyverno taught developers best practices right at the source, improving the quality of their configurations over time.
By codifying our best practices into automated policies, we built a more resilient, secure, and cost-effective platform.