Liveness, readiness, startup: Kubernetes probes

You deploy your application to Kubernetes. In the dashboard everything is green — the pod has the Running status. And yet some users get errors. How is that possible if it “works”? Because Running says only one thing: that the process started and did not crash. It does not say whether the application inside is ready to answer — or whether it has frozen halfway through. This gap is closed by three mechanisms with confusingly similar names: liveness, readiness and startup. You will not find definitions to memorize here — you will break the application in different ways yourself and see how these three probes differ and why each one exists.

Before you start breaking anything, let’s settle what you will actually be looking at in the scenes that follow.

Pod

Container

Process

Let's start with how to read these animations. This is a Pod — the outermost box you'll keep seeing here.

”Running” is not the same as “working”

Running means only that the process came up. Whether the application inside is working, still warming up, or hanging dead — Kubernetes does not see the difference. See how little this really tells it.

Service

Pod

Container

Node.js process

Database

Kubernetes sees only one thing — whether the process is alive. Traffic keeps flowing to the pod and is served.

The way to find out is a probe — a short question that Kubernetes asks the application regularly. Most often it is a short HTTP request to an agreed address (e.g. /healthz), but it can just as well check a port, run a command in the container, or ask over gRPC. The mechanism itself is secondary here; what matters is that the application can finally answer “how am I doing”. And the first, most important question is: can I send you traffic right now?

Readiness: “can I take traffic now?”

Your application depends on a database. When the database disappears for a moment, the application is still alive — but it has no way to handle an order. This is not a failure to restart; it is a temporary “give me a second”. Readiness lets the application say exactly that.

Service

Pod

Container

Node.js process

Database

Readiness answers one question: can the application serve a request right now? The pod now has a readiness probe.

With a single replica readiness still leaves the user with an error — but it shows its real strength when there are more replicas: traffic simply moves to a healthy one.

Service

Pod A

Container

Node.js process

Pod B

Container

Node.js process

Database

Readiness answers one question: can a replica serve a request right now? The Service spreads traffic across two ready replicas.

Readiness handles temporary trouble. But what if the application does not “take a second” and instead gets stuck for good?

Liveness: “are you still alive?”

Sometimes the application deadlocks: the process is alive, but inside it is stuck and will not free itself. Readiness alone would just cut off traffic from it — and that is all; the pod would stay dead inside forever, taking up space. That is what liveness is for.

Service

Pod

Container

Node.js process

This time the application deadlocks — the process is alive, but stuck, and will not free itself.

This is the core that is hardest for a beginner to feel: the same failure, two different reactions. Readiness says “not now”, liveness says “let’s start over”. Mixing them up costs you: an unnecessary restart kills a healthy pod, and a permanent cut-off leaves a dead one. There is one more moment where this mistake is especially painful — the first seconds of the application’s life.

Startup: “let me wake up first”

At startup your application loads a lot of data into memory — it comes up slowly. And liveness only sees “not responding”; it does not tell a deadlocked application apart from one that is just waking up. If it starts checking too early, it will treat a slow start as a failure.

Pod

Container

Node.js process

The application loads a lot of data at start — it boots slowly. Start it; begin without a startup probe.

You now have three probes and you know what each one is for. But where do you actually set them — and what happens when there is more than one container in a single pod?

Where probes live: container or pod?

Probes do not belong to the pod as a whole — you set them separately for each container inside. And a single pod sometimes holds several containers (e.g. the application and a helper sidecar next to it). This raises a question that surprises beginners: if each container has its own probes, then what does the failure of one of them actually act on?

Service

Pod

App

Node.js process

Sidecar

Log sidecar

Probes are set per container in a pod — and a pod can hold several (an app plus a helper sidecar).

You already know what each probe does and what its decision concerns. Let’s look inside a single probe — because when and how often it asks decides whether it works sensibly.

The rhythm of questions: when and how often

A probe does not ask non-stop. It has its own rhythm, set with two knobs: how long to wait before the first question (initialDelaySeconds) and how often to ask afterwards (periodSeconds). These two values decide how quickly Kubernetes even notices that something is wrong.

Service

Pod

Container

Node.js process

Start delay1 s

First set the probe’s rhythm — the pod is still waiting (Pending) and nobody is asking it anything.

But rhythm is not everything. Because what if the application answers badly once in a while — is that immediately a failure?

How many failures make a failure: thresholds

First, about what a single failure even is: the question does not have to get a bad answer — it is enough that it comes too late. timeoutSeconds is the window for an answer; if the application stays silent longer, that check counts as a failure. And one failure is not yet a verdict. failureThreshold says how many failures in a row must happen before Kubernetes acts (restarts or cuts off traffic). successThreshold works the other way: how many good answers in a row are needed to consider the application healthy again (this matters especially for readiness). Together they protect against a single hiccup and against flickering “once good, once bad”.

Service

Pod

Container

Node.js process

Failures in a row3×

A single bad answer isn't a verdict. First set the thresholds — the pod is waiting (Pending) and nobody is asking it yet.

You now know all the knobs: rhythm and thresholds. Now see how bad values can turn against you.

Tuning: how not to hurt yourself

These are not “set and forget” settings. The same knobs turned the wrong way do more harm than no probes at all. Let’s look at four typical ways to hurt yourself.

The most dangerous one is a liveness that is too sensitive under load.

Service

Pod

CPU0%

Container

Node.js process

Traffic0%

One pod with a process and a liveness probe checking it. The line between the probe and the process is the response time. Turn the traffic up with the slider and watch what happens.

On a single pod this is a restart over and over. With many replicas it turns into a cascade — they fall one after another until the service is down.

Traffic0%

Nine replicas — looks robust. Turn the traffic up and watch what the same over-eager liveness does to them.

The other extreme looks innocent: a probe that is too lazy.

Service

Pod A

Container

Node.js process

Pod B

Container

Node.js process

A lazy probe looks cheap — but asking too rarely has a price. Set the interval and see for yourself.

What matters is not only how often you ask, but also what about. A liveness that knocks on a shared dependency can turn a single glitch into a failure of the whole service.

Service

Pod A

Container

Node.js process

Pod B

Container

Node.js process

Database

Two replicas share one database, and each pod's liveness pokes that database. Interestingly, the app itself keeps working correctly regardless of the database's health. Break the database and see.

The most common mistake, though, is more subtle: confusing the roles and using the same check for liveness and readiness.

Service

Pod

Container

Node.js process

Database

The app depends on a database, and liveness asks the same question as readiness. Break the database and see.

Hence the practical rule: set liveness carefully, and only for what truly cannot be saved any other way than a restart; leave temporary trouble to readiness; cover a slow start with a startup probe.

Summary

Below is a manifest with the settings we talked about.

containers:
  - name: container
    readinessProbe:
      httpGet: { path: /readyz, port: 8080 }
      initialDelaySeconds: 0
      periodSeconds: 5
      timeoutSeconds: 1
      failureThreshold: 3
      successThreshold: 1
    livenessProbe:
      httpGet: { path: /livez, port: 8080 }
      initialDelaySeconds: 0
      periodSeconds: 10
      timeoutSeconds: 1
      failureThreshold: 3
    startupProbe:
      httpGet: { path: /livez, port: 8080 }
      initialDelaySeconds: 0
      periodSeconds: 5
      timeoutSeconds: 1
      failureThreshold: 30

Well-chosen probes catch failures that the Running status does not show, and badly-chosen ones can do more harm than no probes at all. So tune them carefully and for the specific application, never blindly. But the hardest part is already behind you: it is not about memorizing the settings, but about which question each probe asks. You will tune the rest once you know what to look for.

This article builds the intuition, not the full reference. For the complete list of fields, defaults and probe types (HTTP, TCP, command, gRPC), check the official Kubernetes documentation: Configure Liveness, Readiness and Startup Probes.

Liveness, readiness, startup: Kubernetes probes

”Running” is not the same as “working”

Readiness: “can I take traffic now?”

Liveness: “are you still alive?”

Startup: “let me wake up first”

Where probes live: container or pod?

The rhythm of questions: when and how often

How many failures make a failure: thresholds

Tuning: how not to hurt yourself

Summary

Related Posts

From Servers to SaaS: An Interactive Story

Liked this?

Liked this?