Kubernetes CrashLoopBackOff: A Root-Cause Playbook

CrashLoopBackOff is one of the most searched Kubernetes errors, and one of the most misunderstood. The first thing to know is that it is not a root cause. It is Kubernetes telling you that a container started, exited, and Kubernetes restarted it, over and over, with an increasing back-off delay between attempts. The crash is the symptom. Your job is to find why the container is exiting.

This playbook walks the five causes behind the large majority of CrashLoopBackOffs, the exact commands to confirm each, and how an autonomous operations platform collapses the whole loop to minutes.

What CrashLoopBackOff actually means

When a pod enters CrashLoopBackOff, the lifecycle looks like this: the container starts, the process inside exits (cleanly or by being killed), the kubelet restarts it per the pod's restart policy, the container exits again, and the kubelet starts backing off, waiting longer between each restart (10s, 20s, 40s, up to 5 minutes). The status you see is the back-off, not the failure itself.

So the question is never "how do I fix CrashLoopBackOff." It is "why is this container exiting." There are five usual answers.

CrashLoopBackOff diagnosis flow

Step 1: get the evidence

Before guessing, gather the three artifacts that answer almost every CrashLoopBackOff: the pod's events, the previous container's logs, and the exit code.

# what state is the pod in, and why
kubectl describe pod <pod> -n <namespace>

# logs from the container that already crashed
kubectl logs <pod> -n <namespace> --previous

# the exit code is the single biggest clue
kubectl get pod <pod> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Step 2: read the exit code

The exit code points you straight at the category:

Exit code	Usual meaning	First move
`137`	OOMKilled (SIGKILL, out of memory)	Raise memory limit or find the leak
`1` / `2`	Application error on startup	Read `logs --previous`; check config and migrations
`143`	SIGTERM (often a failing liveness probe)	Check probe thresholds and startup time
`127`	Command or binary not found	Check the image, entrypoint, and PATH
`0`	Container exits cleanly then restarts	It is finishing and has nothing to keep it running

The five usual root causes

1. OOMKilled (exit 137)

The container exceeded its memory limit and the kernel killed it. describe pod shows Reason: OOMKilled. Either the limit is too low for legitimate usage, or the application is leaking. Raising the limit unblocks production; finding the leak prevents the next page.

2. Configuration and secrets

A missing environment variable, a secret that is not mounted, or an RBAC permission the service account does not have. The app reads a required value, finds nothing, and exits immediately. Logs usually say exactly what was missing.

3. Liveness probe killing a slow starter

If the liveness probe starts checking before the app is ready, Kubernetes concludes the container is unhealthy and restarts it forever. The fix is a correct startupProbe or more generous initialDelaySeconds, not more restarts.

4. Image or dependency problems

A bad image tag, a missing binary in the container, or a hard dependency (a database, a config service) that is itself down. The container cannot complete startup through no fault of its own code.

5. Application bug on startup

An unhandled exception, a failed schema migration, or a panic during boot. This is the only category where the fix is the code, not the platform, and the previous-container logs almost always show the stack trace.

Notice the pattern: four of the five causes are operational, not code bugs. They are exactly the kind of known-shape failures a system can diagnose and remediate without waking a human.

Where this breaks down at scale

This playbook is straightforward for one pod. Now run it across 60 microservices at 3am, where the CrashLoop on payment-service is actually downstream of an OOM on a shared cache, and the engineer on call has never seen this service before. The commands are the same. The cognitive load, the correlation across services, and the tribal knowledge required are what make it a 45-minute incident instead of a 5-minute one.

How autonomous ops resolves it

This is precisely the loop Ops Singularity automates. When a pod enters CrashLoopBackOff, InfraOps and Sentinel pull the events, the previous logs, and the exit code automatically, correlate it with recent deploys and the health of dependencies, and identify the root cause, for example OOMKilled on payment-service with a memory limit set too low after a traffic increase. Where policy allows, ProcBot applies the validated remediation (raise the limit, re-run the rollout), and Sherlock confirms the pod is healthy and stays healthy before the incident is closed. A human is paged only if judgment is genuinely required.

The playbook above is the manual version of what the system does in seconds. Knowing it makes you a better operator. Automating it means you do not have to run it at 3am.

See how InfraOps handles Kubernetes incidents on the products page, or read the glossary for terms like MOP, RCA, and blast radius.

Written by

Shiv Chandra Pathak

Solution Architect, Ops Singularity

LinkedIn → shpathak.com →