Phase 03 — Production

Module 10 of 12 · 9 min read · Free

Module 10: Production Hardening

The difference between a demo and a system you rely on every day is how it behaves when things go wrong.

This is Module 10 of a 12-part curriculum: Build Software Products with AI — From First Principles to Production Pipeline.

Every AI agent demo works. They’re designed to. You see the happy path: the right tools called in the right order, the right output returned, the right action taken.

Production is different. In production, the API returns a 429. The file the agent expected isn’t there. The model output is malformed JSON that breaks downstream parsing. The context window fills up on a longer-than-expected task. The agent calls a destructive tool when it should have asked first.

Production hardening is the engineering discipline of anticipating these failures and building systems that handle them gracefully. This module covers the most important failure modes and how to address them.

Failure Handling

The first rule: never let failures fail silently.

When an agent encounters an error, it should:

Catch the specific error type
Log the error with context (what was attempted, what failed, what the error was)
Decide whether to retry, escalate, or degrade gracefully
Communicate the outcome — either to the user or to a log

What not to do: swallow errors and continue as if nothing happened. Silent failures cause downstream confusion, incorrect state, and very hard to debug problems.

Build explicit error handling into every tool:

async function readFile(path: string): Promise<ToolResult> {
  try {
    const contents = await fs.readFile(path, 'utf-8')
    return { success: true, output: contents }
  } catch (err) {
    if (err.code === 'ENOENT') {
      return { 
        success: false, 
        error: `File not found: ${path}. Verify the path exists before retrying.`
      }
    }
    return {
      success: false,
      error: `Failed to read ${path}: ${err.message}`
    }
  }
}

The error message includes enough context for the model to understand what went wrong and decide what to do next. “Error” alone is useless. “File not found: /path/to/file.ts. Verify the path exists before retrying.” is actionable.

Retry Logic

Some failures are transient: API rate limits, network timeouts, temporary service unavailability. These warrant retries. Others are permanent: bad request format, invalid credentials, file that genuinely doesn’t exist. These should fail fast.

A simple exponential backoff pattern:

async function withRetry<T>(
  fn: () => Promise<T>, 
  maxAttempts = 3,
  baseDelayMs = 1000
): Promise<T> {
  let lastError: Error
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn()
    } catch (err) {
      lastError = err
      
      // Don't retry permanent failures
      if (err.status === 400 || err.status === 401 || err.status === 403) {
        throw err
      }
      
      if (attempt < maxAttempts) {
        const delay = baseDelayMs * Math.pow(2, attempt - 1)
        await sleep(delay)
      }
    }
  }
  
  throw lastError
}

Retry logic should respect Retry-After headers from APIs that return them. If an API says “retry in 60 seconds,” use that value — not your own backoff calculation.

Fallback Models

Model APIs occasionally have outages, elevated latency, or rate limiting. Production systems should have fallback models.

A common pattern:

Primary: Claude Sonnet (fast, capable, main workload)
Fallback: Claude Haiku (faster, cheaper, for lighter tasks when Sonnet is unavailable)
Local fallback: Qwen3:30b via Ollama (runs locally, available even when all cloud APIs are down)

Implement this as model selection logic that degrades gracefully:

async function callModel(prompt: string, tier: 'primary' | 'fallback' | 'local' = 'primary') {
  const models = {
    primary: 'claude-sonnet-4',
    fallback: 'claude-haiku-4',
    local: 'ollama/qwen3:30b'
  }
  
  try {
    return await llm.call(models[tier], prompt)
  } catch (err) {
    if (tier === 'primary') {
      console.warn('Primary model unavailable, falling back')
      return callModel(prompt, 'fallback')
    }
    if (tier === 'fallback') {
      console.warn('Fallback model unavailable, using local')
      return callModel(prompt, 'local')
    }
    throw err
  }
}

For sensitive tasks (personal data, financial data, private project logic), local models aren’t just a fallback — they’re the primary. Don’t route sensitive context through cloud APIs.

Approval Gates

Not every agent action should be automatic. Some actions are irreversible or high-stakes and should require human confirmation before execution.

Common candidates for approval gates:

Deleting or overwriting files
Sending emails or public messages
Making payments or API calls with financial consequences
Merging code to main
Executing destructive database operations

The pattern: before executing a gated action, the agent presents what it’s about to do and waits for explicit approval. If approval is granted, proceed. If rejected, stop and report.

Agent: I'm about to send the following email to the product team:
[email preview]

Approve? (yes/no)

For automated pipelines where human-in-the-loop isn’t practical, consider:

Dry-run mode: execute the action but only simulate the effect, returning what would have happened
Rollback capability: every action that writes state has a corresponding undo
Audit logging: every action taken is logged with enough detail to reconstruct what happened

Context Window Management

Long agent runs fill up context windows. When the context window is full, one of two things happens: the model starts losing older context (sliding window), or the API errors.

Strategies for managing long-running context:

Summarise and compact. Periodically summarise what’s been done so far into a compact representation. Discard the raw interaction history. Continue with the summary as context.

Strategic compaction. At natural milestone boundaries (finished a phase, completed a major sub-task), compact the context before proceeding to the next phase. Don’t wait until you’re forced to — compact at a time of your choosing, while you still have working memory.

Session segmentation. Break long tasks into multiple sessions with file-based handoffs. Session A does research and writes a summary file. Session B reads the summary file and does analysis. Cleaner than one session trying to do everything.

Context budgeting. Before injecting large documents or long histories, estimate their token cost and decide if the benefit justifies the context spend.

Sensitive Data Policies

Some data should never leave your machine. Financial records, health data, private business logic, credentials — these belong in a local-only pipeline.

A practical policy:

Local-only (Ollama, never cloud APIs):
- Any file containing real credentials
- Financial data (bank statements, spending data, net worth)
- Health data (medical records, health metrics)
- Private business strategy before it's disclosed

Cloud APIs acceptable:
- General reasoning, writing, analysis
- Public codebase work
- Non-sensitive content creation

Implement this as a routing rule in your orchestrator: if a task involves sensitive data categories, route to local models. Document the policy so you don’t accidentally route something sensitive to a cloud API in a hurry.

Monitoring and Observability

You can’t improve what you can’t observe. Build in logging from the start.

What to log:

Every model call: timestamp, model, token count, latency, error if any
Every tool call: timestamp, tool name, parameters, result, duration
Every background job run: timestamp, job name, success/failure, output summary
Every agent decision that has side effects

Where to store logs:

For personal systems: plain log files with date rotation (logs/2026-05-09.log)
For product systems: structured logging to a service like Datadog, Logflare, or Axiom

What to alert on:

Repeated failures in the same tool or job
Token usage spikes (could indicate a runaway loop)
Latency degradation
Unexpected tool calls (a coding agent calling a payment API shouldn’t happen)

You don’t need a sophisticated observability stack for a personal agent. A daily log file you can grep is sufficient. What you do need is the habit of looking at it when something seems off.

Testing Your Agents

Agent testing is genuinely hard. The outputs are non-deterministic, the failure modes are complex, and “it worked in the demo” is not a test suite.

Practical approaches:

Unit tests for tools. Tools are functions. Test them like functions. Given this input, does the tool return the expected output? Does it handle error cases correctly?

Prompt regression tests. Keep a library of test prompts with expected behavior profiles. Run them periodically to catch regressions when you update system prompts. Not exact output matching — behavioral matching. “Does it still route correctly? Does it still use the right tool? Does the output still have the right structure?”

Dry-run modes. For background jobs, implement a --dry-run flag that runs the full logic but doesn’t write anything. Review the planned actions before letting the job run live.

Chaos testing. Deliberately introduce failures: missing files, API errors, malformed input. Verify the agent fails gracefully rather than catastrophically.

The Reliability Mindset

Production hardening isn’t a checklist — it’s a mindset.

Every time you build a new agent capability, ask:

What happens if this fails?
What happens if this runs twice?
What happens if the model output is malformed?
What’s the worst action this could take, and do I have a gate on it?
If I looked at the logs, could I tell what happened?

Agents that handle failure well feel trustworthy. Agents that fail opaquely erode trust quickly. And an agent you don’t trust, you don’t use.

What’s Next

Your agent is hardened: it handles failures, retries intelligently, routes sensitive data correctly, has approval gates on dangerous actions, and maintains good logs. In Module 11, we put it all together: a practical, end-to-end walkthrough of setting up your own system from scratch.

Referenced from @nikovijay

“credibility on the internet. In a few years we’ll think it was crazy that we had a scattershot, disconnected internet where five billion people had to run around to ‘do their own research.’” — @AkiK

N+1 Newsletter

Enjoyed this module?

Subscribe to get notified when new modules and courses drop. No drip — just updates when there's something worth reading.

Subscribe on Substack →

Before you dive in —