There’s a Bourdain episode (the Lyon one, Parts Unknown) where he walks into this tiny bouchon and you can see the kitchen through a pass window. Two cooks, maybe three, working in a space the size of a bathroom closet. No yelling. No chaos. Just this quiet, brutal efficiency where every movement has a purpose and nothing is wasted. “This is what it looks like when people know exactly what they’re doing.”
I think about that scene sometimes. Not because I know anything about running a kitchen (I don’t), but because it might be the best visual metaphor I’ve come across for what good engineering should feel like. Almost never does.
Most software development teams don’t operate like that bouchon. They operate like the restaurant down the street that’s always “temporarily closed for renovations.” Requirements arrive half-baked. Somebody starts coding before anyone’s agreed on what “done” looks like. The tests get written after the code, if they get written at all. Three developers are working on the same feature in ways that will conflict when they try to merge. The project manager is updating a Jira board that stopped reflecting reality sometime around sprint two. And everyone’s pretending that “we’ll clean it up later” isn’t the most expensive lie in technology.
We don’t build software like that. The way we do build it is interesting enough to write about, not because it’s perfect, but because it’s working, and we’re figuring it out in public.
The Spec: 1,000 Lines Before a Single Line of Code
This is the part where most engineers check out, because it sounds like everything they hate about enterprise development. Specifications. Requirements documents. Formal planning.
This isn’t waterfall.
Mise en place is the kitchen discipline of having every ingredient prepped, every sauce reduced, every garnish cut before service starts. The religion of the professional kitchen. The idea is simple: you do the thinking and the preparation when it’s calm so you can execute when it’s not. A cook who shows up to a Friday rush and starts dicing onions is a cook who’s about to have a very bad night.
Our spec process is the software equivalent. It’s built on Speckit, GitHub’s open-source specification framework, which we’ve adapted and extended to fit our workflow. The result is a six-step pipeline that turns plain English into a bulletproof blueprint before anyone writes a line of code.
Step 1: Specify. Someone describes what they need in normal human language. No code, no schemas, no architectural diagrams. Just: “We need a way for brokers to build multi-layer insurance programs, market those layers to carriers, and visualize the whole thing as a tower diagram.” The AI transforms this into a structured specification: user stories, functional requirements, edge cases, success criteria. All in plain English. A business person could read it and know exactly what’s being built.
Step 2: Clarify. This is where the AI plays devil’s advocate. It reads through the spec and asks pointed questions about things that are vague, ambiguous, or could be interpreted multiple ways. “When a program transitions to BOUND status, does the tower service create the policies itself, or does it fire an event and let the policy module handle it?” These aren’t academic questions. Each one could mean the difference between building the right thing and building something that needs to be torn down.
Step 3: Plan. Now we go technical. Architecture, data models, API contracts, technology decisions with explanations of why. There’s also a constitution check. Our project has non-negotiable principles (modular architecture, test-first development, enterprise security, specific performance targets), and the plan gets validated against them. You can’t just say “we’ll skip security on this one.” The constitution doesn’t allow it.
Step 4: Tasks. The plan gets decomposed into individual tasks, sometimes hundreds of them. Each one is specific enough that an AI agent can execute it autonomously: “Create TowerProgram Zod validation schema in modules/tower-service/backend/src/schemas/program.schema.ts.” That’s not a user story. That’s a work order with a file path.
Step 5: Analyze. Cross-reference everything. Does the plan mention a feature that isn’t in the spec? Does a task reference a data model that doesn’t exist? Did we say “real-time updates” but forget to plan the WebSocket infrastructure? This step catches contradictions before they become bugs.
Step 6: Checklist. This is my favorite part. Domain-specific quality checklists (platform, security, financial, insurance, AI) that validate the requirements themselves. For our Coverage Tower feature, we had 74 checklist items across 6 domains. Every single one had to be addressed before we could start building. Not “acknowledged,” but addressed, with a detailed response explaining exactly where in the spec the requirement is satisfied.
Example:
CHK029: “Is the bind operation’s atomicity requirement specified?” ADDRESSED: spec.md “Bind Operation Atomicity” edge case specifies: single database transaction, outbox pattern for event emission, idempotency keys for retry safety.
We verified that the blueprint specifies what happens if the system crashes mid-operation while binding an insurance program. In insurance, “oops, we lost half the transaction” is not an acceptable answer. Not for regulators, not for carriers, not for the brokers whose clients are depending on that coverage being in force.
The whole Speckit process typically takes 2-3 days for a major feature. That might sound like a lot of time spent not coding. It is. That’s the point. In my experience, most software projects fail not because the code is bad, but because nobody agreed on what the code was supposed to do. We’d rather spend three days on a spec than three months building the wrong thing.
The Coverage Tower: Making It Concrete
Our most recent major feature shows what this looks like in practice.
In the E&S insurance world, large risks are split across multiple layers and multiple carriers. A $50 million building might have a primary layer (0-10M) with one carrier, a first excess layer (10-25M) split between two carriers at 60/40, and a second excess (25-50M) split three ways. Stack these layers and you get a tower. Brokers need to build them, market them, visualize them, track the financials across all 50 US states (each with different surplus lines tax rules), and manage quota share arrangements.
The Coverage Tower went through the full Speckit pipeline. Three days from “we need this” to “foundation is built and tested”:
- Day 1: Initial specification. User stories, requirements, edge cases.
- Day 1-2: Clarification. 3 targeted questions resolved, answers folded into spec.
- Day 2: Technical planning. Architecture, data models, API contracts.
- Day 2-3: Domain research. 77,758 lines of insurance domain analysis. Surplus lines tax rules for all 50 states. Quota share treaty mechanics. Binding procedures. The kind of research that would take a human team weeks.
- Day 3: Task generation (170+ tasks), checklist validation (74 items), two waves of spec refinement (36 edge cases added, event contracts created, all checklist items addressed), then foundation orchestration
The spec alone is thousands of lines. Most of them matter.
The Swarm: Parallel Agents Building in Waves
This is where it gets wild.
Those 170+ tasks? A human developer works through them sequentially. Task 1, then 2, then 3. At that rate, this feature could take months. We don’t do that.
We use orchestration, a system where a conductor AI reads the plan, figures out which tasks can be done simultaneously, and dispatches a fleet of specialist agents to work on them in parallel. Think of it as a general contractor managing a crew of specialists who all work at the same time.
We have 13 types of specialist agents. A Database Architect for schema work. A Backend Engineer for APIs. A Frontend Engineer for React components. A QA Test Engineer. A Security Engineer. An Insurance Domain Analyst. A Rust Systems Engineer for our Pingora-based router. And at the cheap end, a Simple Coder for mechanical tasks. You don’t hire a structural engineer to hang a picture frame.
The conductor groups tasks into waves, batches of work that can run simultaneously without stepping on each other’s toes. For the Coverage Tower foundation:
Wave 1: 12 tasks, 6 agents working simultaneously: Database Architect creates the schema. Backend Engineer sets up event types and API types. Simple Coder creates authorization permissions, error classes, route files. Build Verifier installs dependencies and generates database code.
Wave 2: After Wave 1 completes, 2 tasks: Simple Coder wires up the module entry point (which needs Wave 1’s files to exist). Simple Coder creates security tests.
Tasks that don’t depend on each other run at the same time. Tasks that depend on earlier work wait. A construction crew can frame walls and run electrical simultaneously, but nobody hangs drywall until the framing is done. Same principle.
Here’s a real-world example that shows the power of this. We needed to update the dark theme across the entire platform: shared UI library, application shell, monitoring dashboard (8 pages), several feature modules, two infrastructure dashboards. That’s 21 tasks across 8 packages. A developer might spend a full week.
The orchestrator broke it into 4 waves:
- Wave 1: Core theme definition and main layout (2 tasks)
- Wave 2: Shell components and tests (5 tasks)
- Wave 3: All monitoring dashboard pages (8 tasks in parallel)
- Wave 4: All feature module components (6 tasks in parallel)
21 coordinated changes across 8 packages, tested and verified. All four quality gates passed.
The Gate That Won’t Let You Through
After every wave, an automated quality gate runs. This is structurally enforced, not a suggestion. The gate checks:
- Type checking: does the code make logical sense?
- Test execution: do all tests pass?
- TDD compliance: were tests written alongside the code? (Constitutional, not optional.)
- Coverage audit: is enough of the code tested? We require 80-90%.
- Pattern checks: any forbidden patterns? Think of it as a building inspector checking for lead pipes.
If the gate says FAIL, work stops. Period. No “well, the important parts work.” No “it’s low risk.” The issue gets fixed, the gate re-runs, and only when it says PASS does the next wave begin.
It’s deliberately rigid, and that’s what makes it work. Catching a problem in Wave 1 means it doesn’t silently cascade through Waves 2, 3, and 4. Bugs compound when they’re ignored. This structure prevents that.
After all waves complete, there’s a final walkthrough: production readiness scan, full test suite across all affected packages, stricter coverage thresholds, dependency security audit, and a report of everything that was done. Only then does the feature get marked as done.
What This Actually Looks Like at 10 AM
Here’s a real development session, start to finish:
10:00 AM: “We need to add endorsement propagation to the coverage tower.” Fire up Speckit. The AI asks: “Should endorsements propagate proportionally to all layers, or only affected layers?” I answer. The spec updates. Twenty-five minutes later, spec is done, checklist runs, three gaps surface, I address them.
10:45 AM: Generate the plan. 12 tasks across 3 phases. Launch orchestration. Wave 1: database changes, backend logic, test scaffolding, 4 agents in parallel.
11:05 AM: Wave 1 quality gate: PASS. Wave 2 launches: API routes, frontend hooks, integration tests, 3 agents.
11:20 AM: Wave 2 quality gate: FAIL. A test is missing for an edge case. Remediation task auto-created. Agent fixes it. Re-run gate: PASS. Wave 3: frontend components, Storybook documentation. 2 agents.
11:45 AM: Wave 3 quality gate: PASS. Completion protocol runs. All clear. Feature is built, tested, documented, and ready for review. Time for lunch.
One hour and forty-five minutes from “we need this” to “it’s done and tested.” I spent most of that time answering clarifying questions and reviewing the spec. Not writing code.
And it’s not always that clean. When we built the Watchtower dependency graph, we went through five visual iterations: flat cards (boring), 3D cubes (too heavy, performance tanked), stripped-back flat cards with React.memo (fast but generic), hex-shaped nodes with gradients (distinctive), and finally the polished version with glassmorphism and dark theme support. Each pivot was its own orchestration cycle. The difference is that even the messy, iterative parts were structured, quality-gated, and traceable.
Enter Forge
All of the above works. Really well. But it’s held together with prompt commands, slash commands, and a conductor that lives in a Claude Code session. It works the way a brilliant one-person operation works. Everything’s in the operator’s head, and the moment complexity outgrows their attention span, things start slipping.
We’ve been building something about that.
Forge is our internal pipeline orchestrator, the tool we’re building to codify everything I just described into repeatable, automated infrastructure. It launches parallel Claude Code instances as OS processes, each in its own isolated git worktree. There’s a real-time web dashboard where you can watch every agent’s terminal output live, like security camera feeds for your AI development crew. Structured observability captures every decision an agent makes, every insight it discovers, every follow-up it flags. Human-in-the-loop controls let you pause an agent, answer its questions, review diffs before approving a wave merge.
It’s the natural evolution: we built the process manually while building plcy.io, figured out what worked, then started building the tool that runs the process for us. We needed it, so we built it. Same philosophy we try to apply to everything. When something doesn’t exist and you need it, you make it.
The CLI is beautifully simple: forge run tasks.md and a swarm of specialists spins up, waves execute, quality gates enforce, and you watch the whole thing from a dashboard that looks like mission control for software development.
The Process Isn’t Sacred
There’s a Bourdain line I like: “Your body is not a temple, it’s an amusement park. Enjoy the ride.” The process isn’t sacred either. The architecture isn’t precious. What matters is: does it work? Does it ship? Can you trust it with real money, real policies, real people?
Our process front-loads the thinking so the building is fast. It enforces quality structurally so nobody has to play code cop. It traces every decision from requirement to code to test to deployment so when a regulator asks “why does the system do X?” we can answer with receipts.
It’s all documented, right here, in public. We’re sharing every lesson, every mistake, every pivot, one post at a time. If you’re doing something similar (building with AI agents, trying to figure out the orchestration, wondering whether the quality can possibly be production-grade) we’re a few months ahead of you on that journey.
Subscribe via RSS, follow along at plcy.io, or just come back when curiosity strikes. The door’s open.