DevCerts logo DevCerts

Why Kubernetes Pods Fail: CrashLoopBackOff, OOMKilled, and ImagePullBackOff Explained

Most Kubernetes pod failures are not mysterious once you separate symptoms from causes. This guide explains CrashLoopBackOff, OOMKilled, ImagePullBackOff, logs, events, and probes through a practical diagnosis workflow.

Kubernetes
Why Kubernetes Pods Fail: CrashLoopBackOff, OOMKilled, and ImagePullBackOff Explained

Kubernetes rarely tells you the root cause directly in the pod status. CrashLoopBackOff, OOMKilled, and ImagePullBackOff are symptoms of different failure points in the pod lifecycle. Treating them as final answers is one of the most common debugging mistakes in production clusters.

The practical goal is not to memorize every status string. The goal is to identify where the failure happens: before the container starts, while the process is running, during health checks, or under resource pressure. Once you know that, the right kubectl commands become obvious.

Pod status is a signal, not a diagnosis

A pod can fail for many reasons, but the investigation usually starts with three questions:

  1. Did Kubernetes manage to pull the image?

  2. Did the container process start and stay alive?

  3. Did the kubelet restart or kill it because of health checks or resource limits?

The same application can move through several states during one incident. For example, a bad deployment may first show ImagePullBackOff because the image tag is wrong. After the tag is fixed, the pod may enter CrashLoopBackOff because the app cannot connect to a required secret. After that, it may become OOMKilled under real traffic because memory limits are too low.

That is why the first step is always to inspect the pod, not to assume the reason from the status column.

kubectl get pods -n production

kubectl describe pod api-7f9d8c6d5b-k2xpl -n production

kubectl get events -n production \
  --sort-by=.lastTimestamp

kubectl get pods gives you the symptom. kubectl describe pod gives you container state, restart count, probe failures, scheduling issues, image pull errors, and recent events. kubectl get events shows what Kubernetes tried to do and what failed.

A decision table for common pod failures

Symptom

Failure point

What to inspect first

Common causes

Production risk

ImagePullBackOff

Before container start

Events, image name, registry auth

Wrong tag, missing image, expired credentials, private registry access

New rollout cannot start

CrashLoopBackOff

After container start

Previous logs, exit code, env, config, probes

App exits, failed boot, missing config, bad migration, failing liveness probe

Repeated restarts, unstable service

OOMKilled

Runtime under memory pressure

Last state, memory limits, metrics, traffic pattern

Limit too low, memory leak, large request payloads, unbounded cache

Process killed, data loss risk for in-memory work

CreateContainerConfigError

Container configuration

Secrets, ConfigMaps, env references

Missing key, invalid volume mount, bad envFrom source

Pod never starts

Pending

Scheduling

Node capacity, taints, affinity, PVCs

Insufficient CPU or memory, unsatisfied node selector, unbound volume

Capacity or placement issue

The table is useful because it separates lifecycle phases. Debugging an image pull problem with application logs wastes time because the application has not started. Debugging an OOM kill only through readiness probes misses the memory pressure that caused the restart.

CrashLoopBackOff: the app starts, then exits or gets restarted

CrashLoopBackOff means Kubernetes has repeatedly restarted a container and is backing off before trying again. It does not tell you why the container exited. The cause is usually inside the process, its configuration, or the health check behavior.

Start with the current and previous logs:

kubectl logs api-7f9d8c6d5b-k2xpl -n production

kubectl logs api-7f9d8c6d5b-k2xpl -n production --previous

kubectl describe pod api-7f9d8c6d5b-k2xpl -n production

--previous matters because the current container instance may have only just started. The useful error often belongs to the container that already crashed.

Look for:

  • exit codes in Last State

  • missing environment variables

  • failed database or message broker connections

  • application boot errors

  • migration failures at startup

  • permission errors on mounted volumes

  • liveness probe failures

  • command or entrypoint mistakes

A typical example is an app that starts slowly while the liveness probe expects it to be ready almost immediately.

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

This may be too aggressive for an application that loads configuration, warms caches, or opens database connections before serving traffic. A failing liveness probe restarts the container. A failing readiness probe only removes the pod from service endpoints.

Readiness protects traffic routing. Liveness kills the process. Mixing them up can turn a slow startup into a restart loop.

For applications with expensive startup, add a startupProbe and keep liveness focused on detecting a truly stuck process:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  periodSeconds: 10
  failureThreshold: 18

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

The exact values depend on the workload. The important design rule is clear: startup tolerance should not weaken runtime failure detection, and runtime liveness should not punish normal initialization.

OOMKilled: the container exceeded its memory boundary

OOMKilled means the container process was killed because it exceeded its memory limit. In practical terms, the app used more memory than Kubernetes allowed for that container. The pod may restart, but the original process is gone.

Confirm it from pod state:

kubectl describe pod api-7f9d8c6d5b-k2xpl -n production

kubectl get pod api-7f9d8c6d5b-k2xpl -n production \
  -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

You are looking for Reason: OOMKilled, often with an exit code such as 137. Do not stop at increasing the limit. That may be the right fix, but it may also hide a memory leak or an unbounded workload.

Check these areas:

  • request and limit values for the container

  • memory usage before the kill

  • request payload size and batching behavior

  • in-memory caches without eviction

  • worker concurrency

  • queue consumer prefetch size

  • file processing and buffering

  • language runtime memory settings

A minimal resource configuration is better than no resource model, but it must match the application behavior:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

requests influence scheduling. limits define hard runtime boundaries. If the memory limit is lower than the application’s normal peak, restarts are expected behavior, not a cluster anomaly.

For backend services, memory peaks often come from traffic shape rather than average load. A service that is stable at low concurrency may fail when several large requests arrive together. A queue worker may look healthy until it processes a batch with larger payloads. A frontend server-side rendering process may have different memory behavior from a static asset server.

The production fix is usually one of these:

  • reduce concurrency per pod

  • stream large payloads instead of buffering them

  • add cache bounds and eviction

  • tune runtime memory settings

  • increase memory limits after measuring actual peaks

  • split heavy background work from request-serving containers

ImagePullBackOff: Kubernetes cannot fetch the image

ImagePullBackOff happens before your application runs. There are no useful app logs because the container does not exist yet. The source of truth is events.

kubectl describe pod api-7f9d8c6d5b-k2xpl -n production

kubectl get events -n production \
  --field-selector involvedObject.name=api-7f9d8c6d5b-k2xpl \
  --sort-by=.lastTimestamp

Common causes include:

  • typo in the image name

  • tag not pushed to the registry

  • private registry credentials missing or expired

  • imagePullSecrets not attached to the service account

  • registry rate limits or network restrictions

  • deployment points to a tag that exists in one environment but not another

A deployment fragment may look correct but still fail if the tag was never pushed or credentials are not available in the namespace:

spec:
  template:
    spec:
      imagePullSecrets:
        - name: registry-credentials
      containers:
        - name: api
          image: registry.example.com/platform/api:2026-04-22
          imagePullPolicy: IfNotPresent

For production delivery, immutable image references are easier to reason about than mutable tags such as latest. The core issue is traceability. During rollback or incident review, you need to know exactly which build was deployed.

Probes can create failures, not only detect them

Health probes are often treated as observability configuration. They are more than that. They actively control pod lifecycle and traffic routing.

Probe

Runtime effect

Failure behavior

Use for

Avoid using for

startupProbe

Delays liveness checks until startup succeeds

Container can keep starting until threshold is exceeded

Slow initialization, warm-up, migrations that must finish before serving

Normal runtime dependency checks

livenessProbe

Decides whether the container should be restarted

Failed probe restarts the container

Deadlocks, stuck event loops, broken process state

Temporary database outage or slow downstream dependency

readinessProbe

Decides whether the pod receives traffic

Failed probe removes pod from endpoints

Dependency readiness, draining, temporary overload

Killing unhealthy processes

A common mistake is putting database connectivity into liveness. If the database has a short outage, every pod may restart even though the application process is otherwise healthy. That can make recovery slower and amplify the incident.

A better pattern is:

  • liveness checks the local process

  • readiness checks whether the pod can safely receive traffic

  • startup handles slow boot separately

This separation reduces false restarts and makes failure behavior easier to predict.

A practical diagnosis workflow

When a pod is failing, use a fixed sequence. It prevents random debugging and keeps the team focused.

# 1. See the symptom and restart count
kubectl get pods -n production -o wide

# 2. Inspect lifecycle state and recent events
kubectl describe pod api-7f9d8c6d5b-k2xpl -n production

# 3. Read current logs
kubectl logs api-7f9d8c6d5b-k2xpl -n production

# 4. Read logs from the previous crashed container
kubectl logs api-7f9d8c6d5b-k2xpl -n production --previous

# 5. Check namespace events in time order
kubectl get events -n production --sort-by=.lastTimestamp

# 6. Inspect the owning workload
kubectl get deployment api -n production -o yaml

For multi-container pods, always specify the container name:

kubectl logs api-7f9d8c6d5b-k2xpl -n production -c api --previous

For deployments, remember that the pod is disposable. The durable configuration usually lives in the owning Deployment, StatefulSet, DaemonSet, or Job. Fixing a pod directly is rarely the right long-term move because the controller can recreate it from the original template.

What to fix first in production

The immediate fix depends on the failure class.

For ImagePullBackOff, verify the image reference and registry access first. There is no point changing application configuration until Kubernetes can pull the image.

For CrashLoopBackOff, inspect previous logs and exit reason before changing probes. If the app exits because of missing configuration, relaxing liveness settings only delays the failure.

For OOMKilled, compare normal and peak memory behavior with configured limits. Increasing memory may be necessary, but also review concurrency, buffering, cache size, and workload shape.

For probe-related restarts, separate startup, liveness, and readiness. This often reduces unnecessary restarts without hiding real failures.

If Kubernetes operations are part of your regular engineering work, the Kubernetes Specialist certification is a relevant next step to review, especially if you want to validate practical knowledge around workloads, debugging, probes, and production reliability.


Conclusion

Pod failures become manageable when you stop reading the status column as the root cause. ImagePullBackOff points to image retrieval. CrashLoopBackOff points to repeated process failure or restart triggers. OOMKilled points to memory pressure against container limits.

In real projects, the difference matters. The wrong diagnosis leads to noisy restarts, unsafe probe settings, over-provisioned workloads, or slow rollbacks. The reliable approach is consistent: inspect events, read previous logs, check container state, verify probes, and connect the symptom to the exact lifecycle phase where the pod failed.