A slow Python API is often misdiagnosed as a framework problem, an async problem, or a language problem. In real services, the endpoint is usually slow because one part of the request path blocks everything else: a synchronous library inside an async handler, a database pool with too few connections, an ORM query that loads too much data, or a response model that spends too much CPU turning objects into JSON.
The central mistake is treating async as a performance switch. It is not. Async helps when the service spends much of its time waiting on I/O and every I/O operation cooperates with the event loop. If one dependency blocks, the endpoint can still look modern while behaving like a single-file queue.
Async is a concurrency model, not a speed feature
An async endpoint does not make the database faster, reduce query complexity, compress payloads, or remove CPU cost. It mainly changes what the worker can do while waiting.
That distinction matters in APIs where each request performs several remote operations:
reads from PostgreSQL or MySQL
calls another internal service
queries Redis
validates and serializes response objects
writes logs, metrics, or audit events
If those operations are properly asynchronous, the worker can serve other requests while waiting. If any operation blocks the event loop, other requests wait behind it.
A common mistake is writing an async route but calling blocking code inside it:
# Bad: async route, blocking work inside it
import time
import requests
from fastapi import FastAPI
app = FastAPI()
@app.get("/orders/{order_id}")
async def get_order(order_id: int):
time.sleep(0.2) # blocks the event loop
response = requests.get(f"https://billing.internal/orders/{order_id}")
return response.json()This endpoint is syntactically async but operationally blocking. Under low traffic, it may appear fine. Under concurrent traffic, requests queue behind time.sleep() and requests.get() because neither yields control back to the event loop.
A better version uses non-blocking libraries and avoids artificial blocking:
# Better: async route with cooperative I/O
import httpx
from fastapi import FastAPI
app = FastAPI()
@app.get("/orders/{order_id}")
async def get_order(order_id: int):
async with httpx.AsyncClient(timeout=2.0) as client:
response = await client.get(f"https://billing.internal/orders/{order_id}")
response.raise_for_status()
return response.json()This still does not guarantee low latency. The billing service can be slow. DNS can be slow. JSON parsing can be costly. But the worker is no longer blocked while waiting for the network.
Async code improves concurrency only when the slow part of the request cooperates with the concurrency model.
The sync vs async decision in production
Sync Python APIs are not automatically wrong. For many services, a synchronous stack with enough worker processes is easier to debug, easier to operate, and predictable under load. Async becomes useful when the API has many concurrent I/O-bound requests and the team can keep the whole request path non-blocking.
Runtime model | Concurrency behavior | Failure mode | Operational complexity | Best fit |
|---|---|---|---|---|
Sync workers | One request occupies one worker | Worker exhaustion under slow I/O | Low to Medium | CPU-light APIs, simple CRUD, predictable traffic |
Async workers | One worker can interleave many waiting tasks | Event loop blocked by sync calls | Medium | I/O-heavy APIs with async-compatible dependencies |
Background jobs | Request delegates work to queue | Queue delay, retry complexity | Medium to High | Slow side effects, emails, reports, webhooks |
More replicas | Adds process and connection capacity | Higher database pressure | Medium | Horizontal scaling after bottlenecks are understood |
The table shows why “switch to async” is not a complete plan. If the API is slow because each request runs a bad SQL query, async may increase the number of concurrent bad queries. That can make the database slower and push latency up for everyone.
Database pools: the hidden queue inside your API
A database connection pool is not just a configuration detail. It is a concurrency limit. Every request that needs a database connection either gets one or waits.
When p95 latency rises while CPU usage stays moderate, the pool is one of the first places to inspect. A small pool can serialize requests even when the application has many workers. A large pool can overload the database and increase lock contention, memory usage, and query scheduling overhead.
The failure pattern often looks like this:
API traffic increases.
More requests reach the database layer at the same time.
The pool reaches its limit.
Requests wait for a connection.
Request duration grows.
Connections are held longer.
The pool stays saturated.
This is a feedback loop, not a simple capacity problem.
A simplified SQLAlchemy-style async setup might make the pool explicit:
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker
engine = create_async_engine(
"postgresql+asyncpg://app:secret@db/app",
pool_size=10,
max_overflow=5,
pool_timeout=2,
)
SessionLocal = async_sessionmaker(engine, expire_on_commit=False)The right values depend on worker count, database capacity, query duration, transaction length, and whether other services share the same database. The important part is not the specific number. The important part is that the pool must be treated as a production control point and observed directly.
Track at least:
time spent waiting for a connection
active vs idle connections
query duration by endpoint
transaction duration
database CPU, locks, and slow query logs
request p95 and p99 latency
If pool wait time is high, increasing the pool may help only if the database has spare capacity. If the database is already overloaded, a larger pool can make the incident worse.
ORM queries: readable code can hide expensive behavior
ORMs make data access easier to write, but they can hide the shape of the actual database work. Slow endpoints often contain one of these patterns:
N+1 queries from lazy-loaded relationships
loading full rows when only a few columns are needed
filtering in Python after fetching too much data
missing pagination on list endpoints
serializing deeply nested relationship graphs
long transactions around unrelated work
The application code may look harmless:
# Bad: likely N+1 query pattern
@app.get("/customers")
async def list_customers(session: AsyncSession):
customers = await session.scalars(select(Customer))
return [
{
"id": customer.id,
"email": customer.email,
"orders": [order.id for order in customer.orders],
}
for customer in customers
]If customer.orders is lazy-loaded, this can trigger one query for customers and then one query per customer. On a small local database, the endpoint looks acceptable. In production, network round trips, locks, and row counts expose the real cost.
A better approach is to make the query shape intentional:
# Better: explicit loading and bounded result size
from sqlalchemy import select
from sqlalchemy.orm import selectinload
@app.get("/customers")
async def list_customers(session: AsyncSession, limit: int = 100):
stmt = (
select(Customer)
.options(selectinload(Customer.orders))
.order_by(Customer.id.desc())
.limit(min(limit, 100))
)
result = await session.scalars(stmt)
customers = result.all()
return [
{
"id": customer.id,
"email": customer.email,
"order_ids": [order.id for order in customer.orders],
}
for customer in customers
]This does not mean every query should eager-load everything. It means every endpoint should have a deliberate data access pattern. A detail endpoint, a list endpoint, and an export endpoint have different query shapes and should not reuse the same broad “load the object graph” approach.
JSON serialization can become the endpoint
Once database queries are optimized, the next bottleneck is often response construction. JSON serialization is CPU work. It can dominate latency when the endpoint returns large lists, nested objects, or models with expensive computed fields.
This is especially common when teams return ORM entities directly and rely on the framework to figure out the response. That can create accidental coupling between persistence models and API contracts.
Prefer explicit response shapes:
# Better: return only the fields the API contract needs
@app.get("/orders")
async def list_orders(session: AsyncSession):
rows = await session.execute(
select(Order.id, Order.status, Order.total_cents)
.order_by(Order.created_at.desc())
.limit(100)
)
return [
{
"id": row.id,
"status": row.status,
"total_cents": row.total_cents,
}
for row in rows
]This reduces database work, Python object construction, validation cost, and JSON size. It also makes the API contract more stable. The database model can change without accidentally changing the response payload.
For large responses, also consider:
pagination or cursor-based navigation
field selection for expensive nested data
streaming only when the client can consume it safely
moving exports to background jobs
caching stable computed responses
compressing responses at the edge when appropriate
The point is not to avoid JSON. The point is to stop treating serialization as free.
A practical debugging sequence
When an endpoint is slow, start with the request path rather than the framework brand. Measure each segment before changing architecture.
A useful sequence is:
Confirm whether latency is network, application, database, or serialization heavy.
Check whether async endpoints call sync libraries.
Measure database pool wait time, not just query time.
Inspect the exact SQL generated by ORM code.
Count queries per request.
Measure response payload size and serialization time.
Separate slow side effects from the request path.
Re-test under realistic concurrency, not only single-request local tests.
This order prevents expensive rewrites. It also keeps the team focused on the bottleneck with the highest operational impact.
For example, a 900 ms endpoint might be spending time like this:
Segment | What to inspect | Typical fix direction |
|---|---|---|
Pool wait | Connection acquisition time | Tune worker and pool counts, shorten transactions |
SQL execution | Slow query log, query plan | Add index, change query shape, reduce joins |
ORM hydration | Rows loaded, relationships loaded | Select fewer columns, avoid broad object graphs |
Remote calls | Timeout, retries, dependency latency | Use async client, set budgets, move side effects |
JSON serialization | Payload size, nested models | Reduce response shape, paginate, cache |
The exact numbers will vary by workload, but the diagnostic structure stays useful.
What to adopt first
For most teams, the highest return changes are not dramatic:
Put timeouts on every outbound call.
Make blocking calls visible in code review.
Track database pool wait time.
Log slow queries with endpoint context.
Add query count checks to performance-sensitive tests.
Limit list endpoint payloads by default.
Keep API response models separate from ORM entities.
Move slow side effects to queues when the client does not need the result immediately.
Async migration should come after this analysis, not before it. Otherwise, the team risks making the runtime more complex while preserving the original bottlenecks.
If you work with Python APIs in production and want a structured way to validate senior-level backend judgment, the Senior Python Developer certification is the most relevant DevCerts track to review.
Conclusion
A slow Python API is usually slow at the boundaries. The endpoint waits for connections, runs inefficient queries, blocks the event loop, creates too many Python objects, or serializes more JSON than the client needs.
The practical answer is not “use async” or “avoid async.” The answer is to understand where time is spent, keep the concurrency model consistent, make database access explicit, control pool behavior, and treat serialization as real work. Once those pieces are visible, performance work becomes engineering rather than guessing.