Module 10: Production Hardening
The difference between a demo and a system you rely on every day is how it behaves when things go wrong.
This is Module 10 of a 12-part curriculum: Build Software Products with AI — From First Principles to Production Pipeline.
Every AI agent demo works. They’re designed to. You see the happy path: the right tools called in the right order, the right output returned, the right action taken.
Production is different. In production, the API returns a 429. The file the agent expected isn’t there. The model output is malformed JSON that breaks downstream parsing. The context window fills up on a longer-than-expected task. The agent calls a destructive tool when it should have asked first.
Production hardening is the engineering discipline of anticipating these failures and building systems that handle them gracefully. This module covers the most important failure modes and how to address them.
Failure Handling
The first rule: never let failures fail silently.
When an agent encounters an error, it should:
- Catch the specific error type
- Log the error with context (what was attempted, what failed, what the error was)
- Decide whether to retry, escalate, or degrade gracefully
- Communicate the outcome — either to the user or to a log
What not to do: swallow errors and continue as if nothing happened. Silent failures cause downstream confusion, incorrect state, and very hard to debug problems.
Build explicit error handling into every tool:
async function readFile(path: string): Promise<ToolResult> {
try {
const contents = await fs.readFile(path, 'utf-8')
return { success: true, output: contents }
} catch (err) {
if (err.code === 'ENOENT') {
return {
success: false,
error: `File not found: ${path}. Verify the path exists before retrying.`
}
}
return {
success: false,
error: `Failed to read ${path}: ${err.message}`
}
}
}
The error message includes enough context for the model to understand what went wrong and decide what to do next. “Error” alone is useless. “File not found: /path/to/file.ts. Verify the path exists before retrying.” is actionable.
Retry Logic
Some failures are transient: API rate limits, network timeouts, temporary service unavailability. These warrant retries. Others are permanent: bad request format, invalid credentials, file that genuinely doesn’t exist. These should fail fast.
A simple exponential backoff pattern:
async function withRetry<T>(
fn: () => Promise<T>,
maxAttempts = 3,
baseDelayMs = 1000
): Promise<T> {
let lastError: Error
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn()
} catch (err) {
lastError = err
// Don't retry permanent failures
if (err.status === 400 || err.status === 401 || err.status === 403) {
throw err
}
if (attempt < maxAttempts) {
const delay = baseDelayMs * Math.pow(2, attempt - 1)
await sleep(delay)
}
}
}
throw lastError
}
Retry logic should respect Retry-After headers from APIs that return them. If an API says “retry in 60 seconds,” use that value — not your own backoff calculation.
Fallback Models
Model APIs occasionally have outages, elevated latency, or rate limiting. Production systems should have fallback models.
A common pattern:
Primary: Claude Sonnet (fast, capable, main workload)
Fallback: Claude Haiku (faster, cheaper, for lighter tasks when Sonnet is unavailable)
Local fallback: Qwen3:30b via Ollama (runs locally, available even when all cloud APIs are down)
Implement this as model selection logic that degrades gracefully:
async function callModel(prompt: string, tier: 'primary' | 'fallback' | 'local' = 'primary') {
const models = {
primary: 'claude-sonnet-4',
fallback: 'claude-haiku-4',
local: 'ollama/qwen3:30b'
}
try {
return await llm.call(models[tier], prompt)
} catch (err) {
if (tier === 'primary') {
console.warn('Primary model unavailable, falling back')
return callModel(prompt, 'fallback')
}
if (tier === 'fallback') {
console.warn('Fallback model unavailable, using local')
return callModel(prompt, 'local')
}
throw err
}
}
For sensitive tasks (personal data, financial data, private project logic), local models aren’t just a fallback — they’re the primary. Don’t route sensitive context through cloud APIs.
Approval Gates
Not every agent action should be automatic. Some actions are irreversible or high-stakes and should require human confirmation before execution.
Common candidates for approval gates:
- Deleting or overwriting files
- Sending emails or public messages
- Making payments or API calls with financial consequences
- Merging code to main
- Executing destructive database operations
The pattern: before executing a gated action, the agent presents what it’s about to do and waits for explicit approval. If approval is granted, proceed. If rejected, stop and report.
Agent: I'm about to send the following email to the product team:
[email preview]
Approve? (yes/no)
For automated pipelines where human-in-the-loop isn’t practical, consider:
- Dry-run mode: execute the action but only simulate the effect, returning what would have happened
- Rollback capability: every action that writes state has a corresponding undo
- Audit logging: every action taken is logged with enough detail to reconstruct what happened
Context Window Management
Long agent runs fill up context windows. When the context window is full, one of two things happens: the model starts losing older context (sliding window), or the API errors.
Strategies for managing long-running context:
Summarise and compact. Periodically summarise what’s been done so far into a compact representation. Discard the raw interaction history. Continue with the summary as context.
Strategic compaction. At natural milestone boundaries (finished a phase, completed a major sub-task), compact the context before proceeding to the next phase. Don’t wait until you’re forced to — compact at a time of your choosing, while you still have working memory.
Session segmentation. Break long tasks into multiple sessions with file-based handoffs. Session A does research and writes a summary file. Session B reads the summary file and does analysis. Cleaner than one session trying to do everything.
Context budgeting. Before injecting large documents or long histories, estimate their token cost and decide if the benefit justifies the context spend.
Sensitive Data Policies
Some data should never leave your machine. Financial records, health data, private business logic, credentials — these belong in a local-only pipeline.
A practical policy:
Local-only (Ollama, never cloud APIs):
- Any file containing real credentials
- Financial data (bank statements, spending data, net worth)
- Health data (medical records, health metrics)
- Private business strategy before it's disclosed
Cloud APIs acceptable:
- General reasoning, writing, analysis
- Public codebase work
- Non-sensitive content creation
Implement this as a routing rule in your orchestrator: if a task involves sensitive data categories, route to local models. Document the policy so you don’t accidentally route something sensitive to a cloud API in a hurry.
Monitoring and Observability
You can’t improve what you can’t observe. Build in logging from the start.
What to log:
- Every model call: timestamp, model, token count, latency, error if any
- Every tool call: timestamp, tool name, parameters, result, duration
- Every background job run: timestamp, job name, success/failure, output summary
- Every agent decision that has side effects
Where to store logs:
- For personal systems: plain log files with date rotation (
logs/2026-05-09.log) - For product systems: structured logging to a service like Datadog, Logflare, or Axiom
What to alert on:
- Repeated failures in the same tool or job
- Token usage spikes (could indicate a runaway loop)
- Latency degradation
- Unexpected tool calls (a coding agent calling a payment API shouldn’t happen)
You don’t need a sophisticated observability stack for a personal agent. A daily log file you can grep is sufficient. What you do need is the habit of looking at it when something seems off.
Testing Your Agents
Agent testing is genuinely hard. The outputs are non-deterministic, the failure modes are complex, and “it worked in the demo” is not a test suite.
Practical approaches:
Unit tests for tools. Tools are functions. Test them like functions. Given this input, does the tool return the expected output? Does it handle error cases correctly?
Prompt regression tests. Keep a library of test prompts with expected behavior profiles. Run them periodically to catch regressions when you update system prompts. Not exact output matching — behavioral matching. “Does it still route correctly? Does it still use the right tool? Does the output still have the right structure?”
Dry-run modes. For background jobs, implement a --dry-run flag that runs the full logic but doesn’t write anything. Review the planned actions before letting the job run live.
Chaos testing. Deliberately introduce failures: missing files, API errors, malformed input. Verify the agent fails gracefully rather than catastrophically.
The Reliability Mindset
Production hardening isn’t a checklist — it’s a mindset.
Every time you build a new agent capability, ask:
- What happens if this fails?
- What happens if this runs twice?
- What happens if the model output is malformed?
- What’s the worst action this could take, and do I have a gate on it?
- If I looked at the logs, could I tell what happened?
Agents that handle failure well feel trustworthy. Agents that fail opaquely erode trust quickly. And an agent you don’t trust, you don’t use.
What’s Next
Your agent is hardened: it handles failures, retries intelligently, routes sensitive data correctly, has approval gates on dangerous actions, and maintains good logs. In Module 11, we put it all together: a practical, end-to-end walkthrough of setting up your own system from scratch.
Further Reading
-
[Pattern] Exponential Backoff — Google Cloud Documentation — The canonical implementation reference for exponential backoff with jitter. Language-agnostic.
-
[Blog] The Prompt Injection Problem — Simon Willison — The most important security concern for production agents. Read this before deploying anything public-facing.
-
[Docs] Anthropic Safety and Responsible Use — Anthropic’s own guidance on building safe agents. Covers prompt injection, jailbreaking, and output validation.
-
[Paper] OWASP Top 10 for LLM Applications — The security community’s authoritative list of LLM application vulnerabilities. Essential for production hardening.
-
[Docs] Anthropic Rate Limits — Current rate limits for the Anthropic API. Know these before you design concurrent agent workloads.
-
[Course] MIT 6.5940 - TinyML and Efficient Deep Learning — Covers running models efficiently at the edge and locally. Relevant when you’re optimising inference costs or adding local model fallbacks to a production system.
Referenced from @nikovijay
“credibility on the internet. In a few years we’ll think it was crazy that we had a scattershot, disconnected internet where five billion people had to run around to ‘do their own research.’” — @AkiK
Subscribe to get notified when new modules and courses drop. No drip — just updates when there's something worth reading.
Subscribe on Substack →