Graceful shutdown in Go becomes important the first time a deployment causes dropped HTTP requests, duplicated queue messages, or half-finished background jobs. The problem is rarely Go itself. The problem is that many services treat shutdown as an afterthought: catch SIGTERM, call Close(), and hope the process exits cleanly.
In Kubernetes, that is not enough. During a rollout, a pod may still receive traffic for a short period, active requests may still be running, workers may still be processing jobs, and Kafka or RabbitMQ consumers may still hold messages that are not safe to acknowledge. A correct shutdown path must coordinate all of these moving parts within a bounded termination window.
Graceful shutdown is a lifecycle, not a signal handler
A production Go service usually has several concurrent responsibilities:
An HTTP API accepting external requests
Background workers running scheduled or internal jobs
Kafka, RabbitMQ, or other queue consumers
Database connections and transactions
Metrics, tracing, and log flushing
Kubernetes readiness and termination behavior
The mistake is to wire all of this directly to os.Signal and let every component decide what to do. That creates races. The HTTP server may still accept requests while the consumer is closing. A worker may start a new job after shutdown has already begun. A message may be acknowledged before its side effects are safely committed.
A better model is explicit application lifecycle management:
Receive termination signal.
Mark the process as not ready.
Stop accepting new work.
Let in-flight work finish within a deadline.
Cancel remaining work.
Flush telemetry and close resources.
Exit with a predictable status.
A graceful shutdown is not successful because the process exits. It is successful because the process exits without creating ambiguous work.
What Kubernetes changes during rollout
In a typical Kubernetes rollout, old pods are terminated while new pods are started. The old pod receives SIGTERM, and Kubernetes waits for the configured termination grace period before forcing the container to stop.
From the application point of view, the risky part is the gap between “the pod is terminating” and “no traffic or work can reach it.” Service endpoint updates, load balancer behavior, client retries, and connection reuse can overlap. This means the application should not assume that receiving SIGTERM instantly removes it from all traffic paths.
The service should become unready early, then stop accepting new work, then drain existing work. Readiness is not a replacement for shutdown logic, but it is an important part of the contract.
Naive shutdown versus production shutdown
Area | Naive approach | Production-oriented approach | Runtime behavior |
|---|---|---|---|
HTTP API | Exit on SIGTERM | Stop accepting new requests and drain active ones | Lower risk of dropped requests |
Readiness | Always returns 200 until exit | Returns failure once shutdown starts | Pod leaves traffic rotation earlier |
Workers | Loop until process dies | Stop polling, finish current job, respect deadline | Fewer partial jobs |
Kafka consumers | Close immediately | Stop fetching, finish processing, commit completed offsets | Lower duplicate or lost processing risk |
RabbitMQ consumers | Ack early or close channel | Ack only after successful processing, nack or requeue unfinished work | Clearer message ownership |
Shutdown deadline | No explicit timeout | Context with bounded grace period | Predictable exit behavior |
Observability | Logs disappear on exit | Final logs and metrics are flushed where possible | Easier rollout debugging |
The goal is not to make shutdown infinitely patient. It is to make shutdown bounded and understandable.
A practical Go shutdown skeleton
The core pattern is a root context controlled by signals, plus a separate shutdown context with a deadline. The root context tells components to stop starting new work. The shutdown context limits how long the service will wait.
package main
import (
"context"
"errors"
"log/slog"
"net/http"
"os"
"os/signal"
"sync/atomic"
"syscall"
"time"
)
func main() {
var shuttingDown atomic.Bool
rootCtx, stop := signal.NotifyContext(
context.Background(),
syscall.SIGINT,
syscall.SIGTERM,
)
defer stop()
mux := http.NewServeMux()
mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
if shuttingDown.Load() {
http.Error(w, "shutting down", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
mux.HandleFunc("/api/orders", func(w http.ResponseWriter, r *http.Request) {
select {
case <-r.Context().Done():
return
default:
// Handle request using r.Context().
w.WriteHeader(http.StatusAccepted)
}
})
server := &http.Server{
Addr: ":8080",
Handler: mux,
ReadHeaderTimeout: 5 * time.Second,
}
errCh := make(chan error, 1)
go func() {
slog.Info("http server started", "addr", server.Addr)
errCh <- server.ListenAndServe()
}()
select {
case <-rootCtx.Done():
slog.Info("shutdown signal received")
case err := <-errCh:
if !errors.Is(err, http.ErrServerClosed) {
slog.Error("http server failed", "error", err)
os.Exit(1)
}
}
shuttingDown.Store(true)
shutdownCtx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
if err := server.Shutdown(shutdownCtx); err != nil {
slog.Error("http shutdown deadline exceeded", "error", err)
_ = server.Close()
}
slog.Info("service stopped")
}This does three important things:
Readiness fails after shutdown starts.
http.Server.Shutdownstops accepting new connections and waits for active handlers.The shutdown wait is bounded.
The timeout should be lower than the Kubernetes termination grace period, leaving time for final logs and cleanup. For example, if the pod has a 30 second grace period, the application should not spend all 30 seconds inside HTTP shutdown.
Kubernetes configuration should match the app lifecycle
The Kubernetes side should reflect the same assumptions. The termination grace period must be long enough for realistic request and worker completion, but not so long that rollouts stall when a pod is unhealthy.
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
spec:
replicas: 3
template:
spec:
terminationGracePeriodSeconds: 35
containers:
- name: app
image: orders-api:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
failureThreshold: 1A preStop hook can sometimes help when external load balancers need a short delay before traffic fully drains, but it should not be the main shutdown mechanism. The application still needs to handle SIGTERM correctly. Sleeping in preStop without application-level draining only moves the race somewhere else.
Workers: stop polling before stopping execution
Workers need a different shutdown strategy from HTTP handlers. A worker should stop taking new jobs as soon as shutdown begins, but it may continue the job it already owns if there is enough time left.
The worker loop should be driven by context. Avoid loops that ignore cancellation until the next long sleep or blocking operation finishes.
func runWorker(ctx context.Context, jobs <-chan Job, handle func(context.Context, Job) error) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
case job, ok := <-jobs:
if !ok {
return nil
}
jobCtx, cancel := context.WithTimeout(ctx, 20*time.Second)
err := handle(jobCtx, job)
cancel()
if err != nil {
// Record failure, retry, or let the queue mechanism redeliver.
// The exact behavior should match the job's idempotency model.
return err
}
}
}
}The important rule is simple: shutdown should prevent new work from being claimed. It should not blindly kill work that is already past the point of no return.
For long-running jobs, add checkpoints. A job that takes several minutes should not be treated as one uninterruptible unit unless the platform allows that much termination time. Store progress explicitly, make operations idempotent, and design retries as part of the normal path.
Kafka consumers: commit only what is complete
Kafka shutdown is mostly about offset ownership. The unsafe pattern is to read a message, commit the offset, and then process the message. If the process dies after the commit but before the side effect, the message may be skipped from the consumer group’s point of view.
A safer pattern is:
Stop polling when shutdown starts.
Finish messages already handed to handlers.
Commit offsets only after successful processing.
Close the consumer before the shutdown deadline expires.
Pseudocode varies by client library, but the lifecycle should look like this:
func consumeKafka(ctx context.Context, consumer Consumer, handle func(context.Context, Message) error) error {
defer consumer.Close()
for {
select {
case <-ctx.Done():
return nil
default:
msg, err := consumer.Poll(ctx)
if err != nil {
if ctx.Err() != nil {
return nil
}
return err
}
if err := handle(ctx, msg); err != nil {
// Do not commit a failed message.
// Let retry, dead-letter, or redelivery policy handle it.
return err
}
if err := consumer.Commit(ctx, msg); err != nil {
return err
}
}
}
}This does not eliminate duplicate processing. Kafka consumers should still be idempotent because a process can crash between the side effect and the commit. Graceful shutdown reduces unnecessary duplicates during planned rollouts, but it is not a substitute for idempotency.
RabbitMQ consumers: acknowledge after durable success
RabbitMQ has a similar but not identical concern. The key decision is when to acknowledge a delivery. If the consumer sends ack before the business operation is durable, a shutdown can lose work. If it never acknowledges successful work, the message may be redelivered and processed again.
A production consumer should usually:
Use manual acknowledgements.
Ack only after successful processing.
Nack or requeue when work cannot finish safely.
Stop consuming new deliveries when shutdown starts.
Keep handler concurrency bounded.
func handleDelivery(ctx context.Context, d Delivery, process func(context.Context, []byte) error) {
err := process(ctx, d.Body)
if err == nil {
_ = d.Ack(false)
return
}
if ctx.Err() != nil {
// Shutdown interrupted processing. Requeue unless the message is known unsafe to retry.
_ = d.Nack(false, true)
return
}
// Non-shutdown failure. Route according to retry or dead-letter policy.
_ = d.Nack(false, false)
}The exact retry policy depends on the system. Some messages should be requeued. Some should go to a dead-letter exchange after bounded attempts. The shutdown path should not invent a different reliability model from the normal failure path.
Coordination: use one shutdown budget
One common production bug is giving every component its own full timeout. The API waits 30 seconds, then the worker waits 30 seconds, then the consumer waits 30 seconds, while Kubernetes only allows 45 seconds. This works in local testing and fails during rollout.
Use one process-level shutdown budget and divide it deliberately:
Component | Shutdown action | Typical constraint |
|---|---|---|
Readiness | Fail immediately | Should happen first |
HTTP server | Drain active requests | Bound by request timeout |
Workers | Finish current job or checkpoint | Bound by job design |
Kafka consumer | Stop polling, finish, commit | Bound by broker session and app deadline |
RabbitMQ consumer | Stop consuming, ack or nack owned messages | Bound by handler deadline |
Telemetry | Flush logs, traces, metrics | Small remaining budget |
The budget should be based on real request timeouts and job behavior, not wishful thinking. If a handler can take two minutes but the pod gets 30 seconds to terminate, the system is already inconsistent.
Testing graceful shutdown
Graceful shutdown should be tested as a behavior, not reviewed as code style. Useful tests include:
Send a long HTTP request, trigger
SIGTERM, verify the request completes.Trigger
SIGTERM, verify readiness fails before process exit.Start a worker job, cancel the context, verify no new job is claimed.
Process a Kafka message, terminate before commit, verify it can be processed again.
Process a RabbitMQ delivery, terminate before ack, verify requeue behavior.
Run a Kubernetes rollout under load and inspect errors, retries, and duplicate work.
Local tests can cover most lifecycle bugs. Cluster tests reveal integration timing issues: readiness propagation, load balancer behavior, connection reuse, and shutdown budget mismatches.
What to adopt first
A team does not need a framework to improve graceful shutdown. The most useful first steps are practical:
Add a single application shutdown context.
Make readiness fail as soon as shutdown begins.
Use
http.Server.Shutdowninstead of abruptly closing the process.Stop workers and consumers from claiming new work.
Acknowledge or commit messages only after durable success.
Make handlers and jobs context-aware.
Align application timeouts with
terminationGracePeriodSeconds.
For engineers who work with Go services in production, the most relevant certification to review is Senior Go Developer, especially if your day-to-day work includes concurrency, APIs, and long-running backend processes.
Conclusion
Graceful shutdown in Go is not a cleanup function at the end of main. It is part of the service contract. In Kubernetes, that contract includes readiness, signal handling, HTTP draining, worker cancellation, broker acknowledgements, and a realistic shutdown budget.
The production goal is not to avoid every retry or duplicate. Distributed systems cannot promise that during every failure mode. The goal is to make planned termination boring: no new work after shutdown starts, active work gets a fair deadline, completed work is recorded correctly, and unfinished work has a clear retry path. That is what turns Kubernetes rollouts from a reliability risk into a routine operation.