Phase 03 — Production
Module 12 of 12 · 12 min read · Free

Module 12: Case Study — Ultron in Production

A complete retrospective of a real multi-agent system built and run daily. What worked, what didn't, and what I'd do differently.

This is Module 12 of a 12-part curriculum: Build Software Products with AI — From First Principles to Production Pipeline.


Everything in this curriculum has been abstracted from a real system. This final module removes the abstraction.

What follows is a complete, honest account of how my agent setup — a multi-agent system I call Ultron — was built, how it works today, what problems I’ve solved, what problems remain, and what I’d do differently if I started over.

This is the module most courses can’t write. They teach from theory and examples they’ve constructed for the course. This is built from a system I use every single day.


What the System Does

Ultron is my personal AI operating system. It’s not a chatbot I occasionally ask questions to. It’s a persistent infrastructure layer that’s running, watching, and working continuously.

On a typical day, before I’ve done anything:

  • A daily note has been created with yesterday’s summary and today’s structure
  • The Obsidian workspace has been backed up to GitHub
  • The backlog has been synced against Obsidian context files
  • Any urgent emails have been flagged

During the day:

  • I assign coding tasks to agents via Slack channels. Each project has a channel. Each channel routes to a coding agent that works in a git worktree, opens a PR, and waits for review.
  • I can ask Ultron anything about my projects, calendar, backlog, or objectives and get a grounded answer — because the memory system has the relevant context.
  • The R&D Council runs at 9am daily: three specialised agents (growth, product, devil’s advocate) analyse the state of my ventures and produce a synthesised memo to my Slack DM.

After work:

  • Alex (my content agent) handles drafts, threads, and content for a new community site i’m making for product builders
  • Coach (my productivity agent) runs weekly planning on Sundays and daily work cycles on weekdays

The Architecture

Ultron (main)
├── Memory layer
│   ├── SOUL.md (identity)
│   ├── USER.md (Niko profile)
│   ├── MEMORY.md (long-term curated)
│   └── memory/YYYY-MM-DD.md (daily episodic)

├── Skills
│   ├── objectives-tracker (OKR progress)
│   └── [project-specific skills]

├── Specialist agents
│   ├── Alex (#community-builder Slack)
│   ├── Coach (#coach Slack)  
│   └── Mailbox (#mailbox Slack)

├── Coding agents (auto-spawned per channel)
│   ├── #whorang-dev → ~/Desktop/whorang/ repo
│   ├── #aim-dev → ~/Desktop/useaim/ repo
│   └── #nikovijay-dev → ~/Desktop/nikovijay.com/ repo

└── Background jobs
    ├── 23:00 — action items cleanup
    ├── 23:30 — GitHub workspace backup
    ├── 01:00 — daily note + priorities
    ├── 08:00 Mon-Fri — morning briefing (Coach)
    ├── 09:00 daily — R&D Council
    └── Heartbeat (30-60 min) — inbox triage, backlog sync

How It Was Built

Not all at once. The system I run today looks nothing like what I had four months ago.

Phase 1: The main agent. I started with just Ultron — one agent, basic memory, a handful of tools. The goal was simple: a persistent assistant that remembered who I am and what I’m working on.

The biggest lesson from Phase 1: the memory system is the hard part. Getting the agent to reliably write to memory files, load them correctly, and actually use that context required more iteration than I expected.

Phase 2: Specialist agents. Once the main agent was stable, I started adding specialist agents for specific workloads. Alex for content. Coach for productivity. Each followed the same pattern: dedicated system prompt, dedicated workspace, dedicated channel.

The lesson from Phase 2: specialisation pays off more than I expected. Alex produces much better content than Ultron ever did when I asked Ultron directly. The focused context matters.

Phase 3: The coding pipeline. The coding agents were the biggest step up in capability. Moving from “ask the agent to write some code and paste it” to a full branch-per-task, worktree, PR pipeline was a significant engineering investment — but the payoff has been enormous.

The lesson from Phase 3: the discipline matters as much as the technology. Branch per task, PR per task, human review on every merge — these aren’t suggestions. They’re what makes the output trustworthy.

Phase 4: Background automation. Once the synchronous capabilities were stable, I added the background layer. Cron jobs and heartbeats.

The lesson from Phase 4: start smaller than you think you need. I over-automated too early and ended up with a noisy system that pinged me for everything. The pruning back to only genuinely useful background work took several weeks.


What Actually Works

The memory architecture. SOUL.md + USER.md + MEMORY.md + daily notes is a durable pattern. It solves the stateless problem without requiring a vector database or complex retrieval. The agent wakes up knowing who it is and who it’s serving. This compounds: after months of daily notes, the agent has rich context about patterns, decisions, and history.

Branch-per-task coding. The single rule that made the coding pipeline trustworthy. Every task isolated. Every change reviewable. Main always clean. This is not negotiable — I’ve never regretted it.

The R&D Council. Three agents with different thinking styles (growth, product, devil’s advocate) reviewing my ventures daily produces more useful challenge than any individual review. The devil’s advocate role especially — it pushes back on comfortable assumptions in ways a growth-optimistic agent never would.


What Doesn’t Work (Yet)

Context window management on long tasks. Long-running coding tasks occasionally hit context limits mid-session. The agent loses earlier context and makes decisions inconsistent with earlier work. I’ve partially solved this with more aggressive compaction instructions, but it’s not fully solved.

Proactive calibration. The heartbeat system is reliable, but the agent isn’t good at deciding when something is genuinely worth surfacing vs. when it should stay silent. I get occasional notifications that don’t warrant interruption. This is a prompt refinement problem I haven’t fully solved.

Multi-project orchestration. When tasks across projects have dependencies (changes to a shared library affect multiple products), the current architecture doesn’t model those dependencies well. Each coding agent operates independently. Cross-project awareness requires explicit human coordination.

Memory pruning. MEMORY.md is getting long. The agent doesn’t reliably prune outdated context on its own. I do manual reviews periodically, but I’d like the agent to maintain better hygiene autonomously.


What I’d Do Differently

Start with the memory system. I spent weeks building tools and prompts before properly solving memory. I should have built SOUL.md, USER.md, and the daily notes system first. Everything else depends on continuity.

Define the task lifecycle earlier. The backlog → ready → in-progress → review → done lifecycle came late. Before it, tasks were ad-hoc and tracking was mental overhead. The lifecycle should be in from day one.

Tighter agent scopes from the start. My early agents tried to do too much. The content agent handled writing, coding guidance, research, and strategy advice. It was mediocre at all of them. When I narrowed each agent to a specific domain, quality improved dramatically.

Better logging from the start. I added structured logging much later than I should have. When something goes wrong in a background job, the ability to grep a log file is invaluable. Build logging in from the beginning.

Write the system documentation as you build. AGENTS.md, the project docs, the data policy — I wrote much of this retrospectively. Writing it as you build forces clarity of thought and creates a better system.


The Honest Assessment

This system has materially changed what I can do. I ship code across multiple projects. I produce content consistently. I process the output of a busy professional life (meetings, emails, decisions, strategy work) without drowning in it.

It is not magic. It requires real engineering, real calibration, and ongoing maintenance. The system I have today is the result of months of iteration — not a setup that worked immediately.

The curve is steeper than the YouTube videos suggest. The ceiling is higher than most people expect.

If you’ve read all twelve modules: you have the mental model, the architecture, and the practical patterns. The only thing that produces a working system now is building it. Start simple. Add one capability at a time. Iterate from feedback. The compounding happens with time.


The Full Curriculum

  1. What is an LLM?
  2. Thinking in Prompts
  3. What is an Agent?
  4. Memory Systems
  5. Skills and Tools
  6. Orchestrators
  7. Multi-Agent Architecture
  8. Agentic Coding Pipelines
  9. Scheduling and Background Work
  10. Production Hardening
  11. Build Your Own Setup
  12. Case Study: Ultron in Production ← you are here

Further Reading

Referenced from @nikovijay

“This is my Mission Control: A Squad of 10 autonomous @openclaw agents… They create work on their own. They claim tasks on their own. They talk with each other.” — @pbteja1998

“The people who thrive next won’t be the deepest specialists.” — @pbteja1998

“‘becoming the answer’ as opposed to ranking on page 1.” — @apoorvshrm

N+1 Newsletter
Enjoyed this module?

Subscribe to get notified when new modules and courses drop. No drip — just updates when there's something worth reading.

Subscribe on Substack →