DevCerts logo DevCerts

Goroutine Leaks, Context Misuse, and Retry Loops in Go Services

Go makes concurrency cheap, but not free. In production services, leaked goroutines, ignored contexts, and unbounded retries often show up as memory growth, hanging requests, noisy shutdowns, and incidents that are hard to reproduce locally.

Go
Goroutine Leaks, Context Misuse, and Retry Loops in Go Services

Go services rarely fail because one goroutine panics in a clean, obvious way. More often, the service keeps running while memory grows, request latency drifts upward, shutdown hangs, and downstream dependencies receive traffic long after the original client has gone away.

The common thread is lifecycle ownership. A production-grade Go service should make it clear who starts work, who can cancel it, how long it may run, how failures are retried, and what happens during shutdown. When those rules are implicit, goroutine leaks, context misuse, and infinite retry loops become operational bugs rather than code style issues.

The incident pattern: the service is alive, but not healthy

Many Go incidents around concurrency have the same shape:

  • Memory usage grows slowly after deployments or traffic spikes.

  • Goroutine count increases and never returns to baseline.

  • Requests hang even when clients have already timed out.

  • Shutdown takes too long or is force-killed by the orchestrator.

  • Retry loops amplify dependency failures instead of absorbing them.

  • Logs show repeated work, but no single stack trace explains the outage.

These failures are difficult because each individual code path may look harmless. A goroutine is small. A retry looks defensive. A background worker seems simple. The problem appears when many requests, timeouts, cancellations, and deploys interact.

In Go services, concurrency bugs are often ownership bugs: work starts somewhere, but nothing has clear authority to stop it.

Goroutine leaks: cheap concurrency still needs an owner

A goroutine leak happens when a goroutine remains blocked or running after the work that needed it is gone. The leak may hold memory, a database connection, a file descriptor, a channel receive, a timer, or simply a stack that keeps growing under pressure.

A common production mistake is starting asynchronous work inside a request path without tying it to request cancellation or service shutdown.

// Bad: this goroutine can outlive the request indefinitely.
func handle(w http.ResponseWriter, r *http.Request) {
    jobs := make(chan Job)

    go func() {
        for job := range jobs {
            process(job)
        }
    }()

    jobs <- Job{ID: r.URL.Query().Get("id")}
    w.WriteHeader(http.StatusAccepted)
}

This code has several lifecycle problems:

  • The worker is created per request.

  • The channel is never closed.

  • The goroutine has no cancellation path.

  • process has no deadline.

  • Shutdown cannot wait for or stop the work cleanly.

In low traffic, this may appear to work. Under load, it can create a growing number of blocked goroutines. During deploys, it can also continue processing after the server has stopped accepting new requests.

A safer design gives the worker a service-level lifecycle and passes context into the unit of work.

type Worker struct {
    jobs chan Job
    wg   sync.WaitGroup
}

func NewWorker(buffer int) *Worker {
    return &Worker{
        jobs: make(chan Job, buffer),
    }
}

func (w *Worker) Start(ctx context.Context) {
    w.wg.Add(1)

    go func() {
        defer w.wg.Done()

        for {
            select {
            case <-ctx.Done():
                return
            case job, ok := <-w.jobs:
                if !ok {
                    return
                }

                jobCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
                process(jobCtx, job)
                cancel()
            }
        }
    }()
}

func (w *Worker) Stop() {
    close(w.jobs)
    w.wg.Wait()
}

func (w *Worker) Enqueue(ctx context.Context, job Job) error {
    select {
    case w.jobs <- job:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

This does not make the system perfect, but it establishes important production rules:

  • The worker starts once, not once per request.

  • The queue is bounded.

  • Enqueue respects caller cancellation.

  • Processing has a deadline.

  • Shutdown can close the queue and wait for the worker.

Context misuse: cancellation is not optional metadata

context.Context is often treated as a parameter that must be passed around to satisfy APIs. That misses its main purpose. Context carries cancellation, deadlines, and request-scoped values across boundaries.

The most damaging mistakes are usually simple:

Mistake

Runtime behavior

Production consequence

Replacing request context with context.Background()

Work ignores client disconnects

Wasted CPU, DB queries continue after timeout

Storing context in a struct

Context outlives its intended scope

Confusing cancellation and hidden coupling

Not checking ctx.Done() in loops

Long work cannot be interrupted

Slow shutdown, stuck requests

Creating timeouts without calling cancel()

Timers live longer than needed

Higher memory pressure under load

Passing context but ignoring it downstream

Cancellation stops at boundary

Partial timeout handling, hard debugging

This example looks reasonable, but it breaks cancellation propagation.

// Bad: the downstream call ignores the request lifecycle.
func getProfile(r *http.Request, client *http.Client, userID string) (*Profile, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "/profiles/"+userID, nil)
    if err != nil {
        return nil, err
    }

    return callProfileService(client, req)
}

The timeout exists, but the parent is wrong. If the client disconnects or the server starts shutting down, this work continues until its own timeout expires. The correct parent is the request context.

func getProfile(r *http.Request, client *http.Client, userID string) (*Profile, error) {
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "/profiles/"+userID, nil)
    if err != nil {
        return nil, err
    }

    return callProfileService(client, req)
}

For service code, a practical rule is simple: accept ctx context.Context as the first parameter of functions that perform I/O, wait, lock, retry, or call external systems. Do not create a new root context unless you are at a true process boundary, such as application startup or a controlled background service.

Infinite retries: resilience can become load amplification

Retries are useful only when they are bounded, delayed, and cancellation-aware. Otherwise they turn one failed request into repeated traffic against a dependency that may already be overloaded.

The dangerous pattern is a retry loop that does not respect context and does not have a budget.

// Bad: infinite retry, no backoff, no cancellation.
func publish(event Event) {
    for {
        err := broker.Publish(event)
        if err == nil {
            return
        }

        time.Sleep(100 * time.Millisecond)
    }
}

This code can run forever during a dependency outage. If many goroutines enter this loop, the service may keep consuming memory and CPU while also hammering the broker.

A safer retry loop has four properties:

  1. It stops when the context is cancelled.

  2. It has a maximum number of attempts or a deadline.

  3. It waits between attempts.

  4. It treats non-retryable errors differently from transient failures.

func publishWithRetry(ctx context.Context, broker Broker, event Event) error {
    delay := 100 * time.Millisecond

    for attempt := 1; attempt <= 4; attempt++ {
        err := broker.Publish(ctx, event)
        if err == nil {
            return nil
        }

        if !isRetryable(err) {
            return err
        }

        timer := time.NewTimer(delay)

        select {
        case <-ctx.Done():
            timer.Stop()
            return ctx.Err()
        case <-timer.C:
        }

        delay *= 2
    }

    return fmt.Errorf("publish failed after retries")
}

The exact retry budget depends on workload and dependency behavior. A user-facing request often needs a small budget because it is bound by latency. A background queue may tolerate more attempts, but should still have visibility, dead-letter handling, or a clear failure state.

Shutdown is where lifecycle bugs become visible

Incorrect shutdown often exposes hidden leaks. The service receives a termination signal, stops accepting new traffic, and then waits. If background goroutines ignore cancellation, active requests never finish, workers keep reading, and the process is eventually killed.

A production shutdown path should coordinate the HTTP server, background workers, queues, and external clients through a shared root context.

func main() {
    root, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
    defer stop()

    worker := NewWorker(100)
    worker.Start(root)

    srv := &http.Server{
        Addr:    ":8080",
        Handler: routes(worker),
    }

    go func() {
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Printf("server error: %v", err)
            stop()
        }
    }()

    <-root.Done()

    shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
    defer cancel()

    if err := srv.Shutdown(shutdownCtx); err != nil {
        log.Printf("http shutdown error: %v", err)
    }

    worker.Stop()
}

This pattern separates two concerns:

  • root tells long-running service components that shutdown has started.

  • shutdownCtx limits how long graceful shutdown may take.

The important part is not the exact timeout value. It is that shutdown has a budget and the rest of the service is written to respect cancellation.

Shortcut vs production-grade behavior

Area

Shortcut implementation

Runtime behavior

Production risk

Production-grade direction

Goroutines

Start inside request handlers

Per-request, untracked

Memory growth, blocked sends

Service-owned workers

Queues

Unbounded or implicit

Memory grows with traffic

Backpressure failure

Bounded channels or external queue

Context

Use context.Background() in deep code

Ignores caller cancellation

Hanging work after timeout

Propagate parent context

Retry

for { retry }

Infinite work

Load amplification

Attempt budget, backoff, cancellation

Shutdown

Stop HTTP only

Background work continues

Forced termination

Coordinated lifecycle

Observability

Only error logs

Leaks are indirect

Slow incident diagnosis

Goroutine count, queue depth, latency, cancellation errors

The practical goal is not to avoid goroutines or retries. It is to make concurrency observable and bounded. Every long-running component should have a start path, a stop path, and a way to report whether it is falling behind.

What to adopt first

Teams do not need to redesign every service at once. The highest-value changes are usually small and mechanical.

Start with these:

  • Pass request context into all I/O operations.

  • Remove context.Background() from business logic and client calls.

  • Make every retry loop bounded and cancellation-aware.

  • Avoid starting untracked goroutines inside handlers.

  • Add shutdown tests for workers and HTTP handlers.

  • Track goroutine count, queue depth, request latency, and dependency errors.

  • Review code paths that use time.Sleep, for, channels, and background jobs.

A useful review question is: “What stops this work?” If the answer is unclear, the code likely has a lifecycle bug.

For engineers who work with Go services in production and want to validate practical backend and concurrency skills, the most relevant certification to review is Senior Go Developer.


Conclusion

Go gives teams a compact and efficient concurrency model, but production safety depends on explicit lifecycle design. Goroutines need owners. Contexts need to propagate cancellation. Retries need budgets. Shutdown needs coordination.

Most incidents in this area are not caused by advanced language edge cases. They come from ordinary code paths that forget to answer basic operational questions: who started this work, how long may it run, what happens when the caller leaves, and how does the process stop?

A service that answers those questions in code is easier to debug, safer to deploy, and less likely to turn a small dependency problem into a full production incident.