LLM features fail in different ways from normal backend code. They can return malformed JSON, ignore part of the prompt, hallucinate missing context, exceed latency budgets, hit provider limits, or produce a technically valid answer that is unusable for the product. The mistake is not using AI in a Python project. The mistake is treating the model call as if it were a deterministic function.
A stable AI service is built around explicit boundaries. Retrieval is separated from generation. Prompts are versioned. Responses are validated. Calls have timeouts, retry policy, fallback behavior, rate limits, and observability. The model remains useful, but it stops being magical infrastructure hidden behind one generate_answer() function.
The real problem: not AI, but implicit behavior
A common first implementation looks compact:
async def answer_question(question: str) -> str:
docs = await vector_store.search(question)
prompt = f"Use these documents:\n{docs}\n\nQuestion: {question}"
return await llm.generate(prompt)This works in a demo because the happy path is overrepresented. In production, it creates several problems at once:
Retrieval quality is invisible.
Prompt changes are hard to trace.
The model can return any format.
There is no clear retry boundary.
Timeouts and provider errors leak into user experience.
Logs do not explain whether the issue was search, model latency, malformed output, or missing context.
Testing becomes snapshot-based guesswork.
The production question is not “Can the LLM answer?” It is “Can the system behave predictably when the LLM does not?”
The model should be inside a controlled boundary, not at the center of your application architecture.
Direct LLM call vs production AI boundary
Aspect | Direct LLM call | Production AI boundary |
|---|---|---|
Runtime behavior | Non-deterministic, loosely controlled | Non-deterministic, constrained by contracts |
Request isolation | Depends on caller | Explicit per-request context |
Failure isolation | Low | Medium to High |
Retry behavior | Usually absent or ad hoc | Centralized and limited |
Response shape | Free-form text | Validated schema |
Logging | Prompt and response dumps, often unsafe | Structured metadata with redaction |
Cost control | Weak | Token, concurrency, and route-level limits |
Testing model | Manual checks | Retrieval tests, schema tests, fallback tests |
Operational overhead | Low at first | Higher, but predictable |
Best for | Prototype, internal demo | User-facing feature, workflow automation, support tooling |
The second approach does not remove uncertainty. It contains it.
Start with RAG as an explicit pipeline
Retrieval-Augmented Generation is often described as “put documents into the prompt.” That is too vague for production. A useful RAG pipeline has at least four explicit stages:
Normalize the user input.
Retrieve candidate chunks with metadata.
Build a bounded prompt from selected chunks.
Validate the answer against an output contract.
A practical retrieval result should carry identifiers, source metadata, score information, and content. Do not pass raw database rows or arbitrary blob text directly into the prompt.
from dataclasses import dataclass
@dataclass(frozen=True)
class RetrievedChunk:
chunk_id: str
document_id: str
title: str
text: str
score: float
def build_rag_prompt(question: str, chunks: list[RetrievedChunk]) -> str:
context_blocks = []
for index, chunk in enumerate(chunks, start=1):
context_blocks.append(
f"[{index}] chunk_id={chunk.chunk_id}, title={chunk.title}\n{chunk.text}"
)
context = "\n\n".join(context_blocks)
return f"""
You answer using only the provided context.
If the context is insufficient, say that the answer is not available in the provided material.
Return valid JSON only.
Context:
{context}
Question:
{question}
""".strip()This prompt is not perfect, but it has useful properties. It makes source boundaries visible. It discourages unsupported answers. It gives the validator a predictable output format. It also gives logs something concrete to record, such as chunk IDs and retrieval scores, without storing entire sensitive documents.
Validate model output like external input
An LLM response should be treated like input from an untrusted external service. It may be malformed, incomplete, too verbose, or structurally valid but semantically unacceptable.
Use a schema for the response. Keep it small. Validate it before returning anything to the caller.
from pydantic import BaseModel, Field, ValidationError
class AnswerResponse(BaseModel):
answer: str = Field(min_length=1, max_length=2000)
confidence: str
cited_chunk_ids: list[str]
def parse_answer(raw_response: str) -> AnswerResponse:
try:
parsed = AnswerResponse.model_validate_json(raw_response)
except ValidationError as exc:
raise ValueError("LLM response did not match the expected schema") from exc
allowed_confidence = {"low", "medium", "high"}
if parsed.confidence not in allowed_confidence:
raise ValueError("LLM response used an unsupported confidence value")
return parsedValidation is not only about JSON correctness. It is also where product rules belong. For example:
The answer must cite at least one retrieved chunk.
The cited chunk IDs must exist in the retrieval result.
The answer must not exceed a display limit.
A low-confidence answer may require a different UI state.
Certain workflows may require human review before execution.
This is where many teams go wrong. They try to solve application constraints with prompt wording. Prompts help, but they are not enforcement.
Retries should be narrow, bounded, and classified
Retries are useful for transport failures, temporary provider errors, and rate-limit responses where retrying is allowed. They are not a fix for bad retrieval, weak prompts, or invalid business logic.
A retry policy should answer four questions:
Which errors are retryable?
How many attempts are allowed?
What is the maximum request deadline?
What happens after the final failure?
import asyncio
import random
from collections.abc import Awaitable, Callable
class RetryableLLMError(Exception):
pass
class NonRetryableLLMError(Exception):
pass
async def call_with_retries(
operation: Callable[[], Awaitable[str]],
*,
attempts: int = 3,
base_delay_seconds: float = 0.4,
) -> str:
last_error: Exception | None = None
for attempt in range(1, attempts + 1):
try:
return await operation()
except RetryableLLMError as exc:
last_error = exc
if attempt == attempts:
break
jitter = random.uniform(0, 0.2)
delay = base_delay_seconds * attempt + jitter
await asyncio.sleep(delay)
raise RuntimeError("LLM call failed after retry limit") from last_errorKeep retries close to the integration layer, not scattered across controllers, jobs, and UI handlers. Also avoid retrying full workflows blindly. If retrieval succeeded, validation failed, and then the whole request is repeated, you may increase cost while producing the same invalid output.
Fallback is a product decision, not an exception handler
A fallback is not just “try another model.” Sometimes that is appropriate, but it is only one option. In many systems, fallback should be tied to product behavior.
Common fallback strategies include:
Failure case | Fallback behavior | Operational consequence |
|---|---|---|
Retrieval returns no useful chunks | Return “not enough context” state | Lower hallucination risk |
Model timeout | Return cached answer if available | Requires cache invalidation policy |
Invalid JSON response | Retry once with stricter repair prompt or return safe failure | Higher latency if repaired |
Provider rate limit | Queue request or degrade feature | More predictable cost control |
Low confidence | Require human review | Higher workflow latency |
Safety or policy violation | Block response and log event type | Requires review process |
A fallback should preserve trust. Returning a generic answer when context is missing is often worse than returning no answer. For support tools, internal search, and compliance-sensitive workflows, an explicit “not enough information” result is a valid successful outcome.
Add limits before traffic arrives
LLM calls are expensive compared with normal application code. They are also easy to multiply accidentally. One user action can trigger retrieval, summarization, answer generation, validation repair, and logging enrichment.
Useful limits include:
Maximum input length before retrieval.
Maximum retrieved chunks.
Maximum prompt size.
Maximum generated output length.
Per-user or per-tenant request limits.
Concurrency limits per worker.
Route-level timeouts.
Queue limits for asynchronous workflows.
Concurrency control is often more useful than optimistic retries. A simple semaphore can protect the application from flooding an external dependency.
import asyncio
llm_concurrency = asyncio.Semaphore(8)
async def limited_llm_call(prompt: str) -> str:
async with llm_concurrency:
return await llm.generate(
prompt=prompt,
timeout_seconds=20,
max_output_tokens=700,
)The exact values must be tuned for the workload, provider limits, latency budget, and infrastructure. The important part is that limits are explicit and owned by the application, not discovered after the service degrades.
Log decisions, not just text
Logging full prompts and responses may help during early development, but it creates privacy, cost, and noise problems. In production, structured logs should explain the path a request took.
A useful log event for an AI request might include:
logger.info(
"rag_answer_completed",
extra={
"request_id": request_id,
"tenant_id": tenant_id,
"prompt_version": "support_rag_v4",
"retrieved_chunk_ids": [chunk.chunk_id for chunk in chunks],
"retrieval_count": len(chunks),
"llm_provider": provider_name,
"model_alias": model_alias,
"attempts": attempts_used,
"fallback_used": fallback_name,
"response_valid": True,
"confidence": answer.confidence,
},
)Avoid logging raw user input, full documents, credentials, secrets, or unredacted model output unless there is a deliberate retention and access policy. For most teams, metadata is more valuable than raw text when debugging production behavior.
The logs should help answer operational questions:
Did retrieval find relevant context?
Did the model call time out?
Was fallback used?
Which prompt version produced the answer?
Did validation fail?
Did one tenant or route consume unusual volume?
Did latency increase at retrieval or generation?
Without this data, teams tend to “debug the prompt” even when the real issue is retrieval quality, rate limits, queue saturation, or malformed downstream data.
Testing: isolate what can be deterministic
LLM output is not fully deterministic, but much of the system around it can be tested.
Good test targets include:
Query normalization.
Retrieval filters.
Prompt construction.
Chunk selection and ordering.
Schema validation.
Retry classification.
Fallback routing.
Logging metadata.
Authorization around document access.
UI states for low confidence or missing context.
Do not make every test depend on a live model call. Use fixtures for model responses, including malformed JSON, missing fields, unsupported confidence values, empty answers, and citations to unknown chunk IDs.
The most useful tests are often not “the model answered correctly.” They are “the service did not return an unsupported answer as if it were verified.”
A practical adoption sequence
For an existing Python project, avoid rebuilding the entire AI layer in one pass. Start by introducing boundaries around the riskiest behavior.
A sensible order is:
Wrap all LLM calls in one integration module.
Add timeouts and bounded retries.
Define response schemas for user-facing outputs.
Store prompt versions as named templates.
Add structured logs with request IDs and retrieval metadata.
Add explicit fallback states.
Add per-route and per-tenant limits.
Build tests for validation, fallback, and retrieval edge cases.
This sequence improves reliability without blocking feature work. It also gives the team better data for later decisions, such as whether to tune retrieval, change the prompt, add caching, use asynchronous jobs, or split model usage by task type.
For engineers who work with Python services in production and want to validate senior-level backend judgment beyond syntax, the Senior Python Developer certification is the most relevant DevCerts path to review.
Conclusion
AI features become unstable when the LLM is allowed to blur architecture boundaries. A Python service should not depend on the model being consistently obedient, fast, cheap, and well-formatted. It should assume the opposite and still behave predictably.
Treat the LLM as an external dependency with uncertain output. Keep RAG explicit. Validate responses. Retry only classified failures. Design fallback as product behavior. Add limits before scale exposes the problem. Log decisions, not just text.
That is the difference between a demo that happens to work and a service a team can operate, debug, and improve over time.