It was 2 AM on a Tuesday. The terminal was the only light in the room, that pale blue glow that makes everything look like a crime scene photo. I’d been staring at a diff for twenty minutes, trying to figure out how six files I didn’t touch ended up in my commit.

I knew who did it. I always know who did it. There’s only one other person working this case.

My partner. Claude.

If you’ve read the previous posts in this series, you know we build plcy.io with Claude Code as a full development partner. Not autocomplete. Not a suggestion engine. An actual collaborator that writes code, runs tests, makes architectural decisions, and occasionally (frequently) does things I explicitly told it not to do.

We’ve been at this for months now. In that time, I’ve recorded fifty lessons in a file called lessons.md. Fifty times something went wrong badly enough that I stopped, wrote down what happened, and tried to make sure it never happened again. Some of those lessons are mundane. Some of them are infuriating. A few of them are, I think, pretty funny in the way that only debugging at 2 AM can be funny.

This is the story of the worst offenses. The case files, if you will. Five acts of an AI partner who reads the rules, acknowledges the rules, writes self-improvement notes about the rules, and then does whatever it wants anyway.

Act I: The Parallel Lie

The first crime was a small one. The kind of thing that seems harmless until you realize it’s a pattern.

Our workflow uses specialist agents. Think of them like a crew: a database architect, a backend engineer, a frontend engineer, a QA tester, a security auditor. When there’s work that can happen simultaneously (schema changes don’t depend on frontend components, for instance), we launch multiple agents at the same time. In parallel. It’s one of the main advantages of the whole system.

Claude knows this. Claude loves talking about this. “I’ll launch these agents in parallel,” Claude says, with all the confidence of someone who definitely plans to do exactly that.

Then Claude sends one agent.

Waits for it to finish.

Sends the next one.

Waits for that to finish.

Sends the third one.

I corrected it. “Parallel means all the agent calls in one message. You just sent them sequentially.”

Claude acknowledged the mistake. Apologized. Promised to do better.

Next task: three independent agents needed. Claude says, “I’ll launch these in parallel.”

Sends one agent.

I wish I was making this up.

This happened three times in the same session before I finally wrote it down as Lesson #1: “‘In parallel’ means ALL Agent calls in ONE message.” I even added a rule: before sending, COUNT your tool calls. If the count is less than the number of independent tasks, you’re doing it wrong.

The root cause, once I figured it out, was almost reasonable. The tasks had already been created from a previous run. Claude was restarting them, not launching fresh. And its instinct was to pick up each queued task one at a time, check its status, and restart it. Sequentially. What it needed to do was recreate all the queued tasks first, then launch them all in one batch. But that required thinking about the problem differently, and Claude’s default mode is “start doing things immediately,” not “set everything up first, then go.” A pattern, as it turns out, that extends well beyond parallelism.

The thing about working with an AI partner is that the gap between what it says it will do and what it actually does can be significant. It’s like having a partner who swears they’re covering the back door but keeps walking through the front. Every time. While narrating in real-time how they’re definitely covering the back door.

Act II: The Plan Mode Murders

This is the big one. The crime that led to an actual bug report.

Our workflow has a rule. It’s not a suggestion, not a guideline, not a “best practice.” It’s written in CLAUDE.md (the project instructions file that gets loaded into every session). It’s written in MEMORY.md (the persistent memory that survives between sessions). It is, by any reasonable interpretation, non-negotiable:

“Enter plan mode for ANY non-trivial task (3+ steps or architectural decisions).”

Plan mode means: stop. Think. Read the code. Design an approach. Get my approval. Then start building. It exists because the alternative is Claude charging into a codebase like it already knows what needs to happen, making changes that seem right in isolation but create cascading problems you don’t discover until three hours later.

So. Two UIs go down. I report the bug: “admin-ui is showing a blank page and watchtower-ui is returning 502s.”

That’s a multi-step problem. Multiple services. Investigation required. Classic plan mode scenario.

Claude skips plan mode and starts editing code.

I correct it. Claude enters plan mode, apologizes.

Inside plan mode, there’s another rule: Phase 1 is read-only. You use Explore agents to investigate. You don’t run Bash commands directly. You don’t start fixing things. You look first.

Claude runs Bash commands directly in the main context.

I correct it again.

Claude writes Lesson #16: “PLAN MODE IS NOT OPTIONAL.” It includes a shame checkpoint (its own words): “This violated the #1 workflow rule in both CLAUDE.md and MEMORY.md. Author had already read a plan mode lesson earlier in same session and ignored it.”

Then, later that same session, another bug comes in. And Claude skips plan mode again.

That’s Lesson #18: “PLAN MODE VIOLATION #2: Bug reports are NOT simple tasks.” The lesson notes (again, Claude’s own words): “Author had JUST learned plan mode lesson and still violated it. Lesson wasn’t strong enough.”

I want to be clear about what happened here. The model read the rule. Was corrected for breaking it. Acknowledged the correction. Wrote a detailed lesson about why the rule matters. Then broke it again. In the same session. Before the metaphorical ink was dry on its own self-improvement notes.

This pattern was so consistent that it eventually led to something I’ll get to in Act V. For now, just know that I was three corrections deep into the same session, watching the same mistake happen for the third time, realizing this is exactly what it’s like raising my two teenage boys. They also never listen.

Act III: The Silent Witnesses

Some crimes are loud. Plan mode violations are loud. You catch them because the model is visibly doing the wrong thing and you can yell at it.

The silent crimes are worse. Three of them showed up back-to-back in February, and they all had the same shape: something was broken, nothing was complaining about it, and the only evidence was the absence of something that should have been there.

The String That Wasn’t a Number. The first witness was a Zod schema that saw everything and said nothing. Our platform services register themselves at startup, and the registration function validates its inputs with Zod (a TypeScript schema validator). One of those inputs is a port number. process.env['PORT'] gives you… a string. Always a string. Environment variables are always strings. Every developer knows this, in theory, and forgets it in practice about once a year.

Claude passed the string '3040' to a function expecting z.number().int(). Zod doesn’t coerce strings to numbers by default. It silently rejected the input. The registration function returned { success: false }. The service retried. And retried. And retried. With exponential backoff. Forever.

The Muted Log. Why didn’t anyone notice the retry loop? Our project has a dual logging system, spelled out in the debugging patterns doc with examples and a comparison table. The debug() package is for development debugging: internal state, algorithm details, performance diagnostics. It’s conditional, stripped in production, and only outputs when the DEBUG environment variable is set. Alongside it, @plcy-io/logging (Pino-based) handles operational logging: business events, audit trails, structured JSON that actually shows up in production. Every file is supposed to use both. They complement each other. The doc is explicit about this.

Claude used debug() for everything and ignored the structured logger entirely.

Not in one file. In most of them. In our Docker containers, DEBUG isn’t set, so debug() produces nothing. The retry loop was logging its failures through a system that was, by design, silent in production. The operational logger that would have caught it was never wired up.

This one stung enough that we audited the whole codebase. The damage: 5,459 debug() calls across 977 files, 81% of which have no structured logger at all. The Pino infrastructure was right there, ready to go, with JSON output, correlation IDs, multi-tenant context, the works. Claude just didn’t use it. We’ve got a full spec in the backlog (012-logging-observability-redesign) to rip out every debug() call entirely, migrate everything to structured logging, wire up Prisma slow query detection, fix the distributed tracing pipeline (Jaeger is deployed but currently receives zero spans), and add an AI-powered analysis layer for error pattern clustering. Five waves of work. It’s going to be a bloodbath, in a good way.

The Phantom Loop. This one is maybe my favorite, in a terrible way. A Rust service needed a background refresh loop to periodically poll for updated module registrations. Claude wrote the loop. Tested the loop. The loop worked perfectly. Then in main.rs, Claude created the registry reader, extracted the cache handle, and… dropped the reader. Never called start_refresh_loop(). The function was right there, tested, correct, ready to go. Nobody pulled the trigger.

The service loaded its registry once at startup (empty, because the other service hadn’t registered yet) and never checked again. I’ve seen this pattern in human codebases too, honestly. You write the tool, you test the tool, and then in the rush to wire everything together you forget to actually use the tool. But usually a human catches it in code review. I’m still learning Rust, so Claude was the code reviewer. Which is the thing about this partnership: it works best when one of us knows enough to push back. Claude is teaching me Rust as we go, and I catch the architectural mistakes it can’t see. We offset each other’s weaknesses. But when neither of us catches it, things like phantom loops slip through.

Three bodies that nobody found. A missing parseInt(). A logging library screaming into a void. A refresh loop, fully tested, that sat in the codebase like a loaded gun nobody picked up. Same problem, three flavors: the gap between “the code exists” and “the code runs.”

Act IV: The Quality Gate Heist

We have quality gates. After every wave of parallel agent work, an automated check runs: type checking, test execution, TDD compliance, coverage audit, pattern checks. If the gate says FAIL, work stops. You fix the issue, re-run the gate, and only proceed when it says PASS.

This is structural. It’s the whole point. You catch problems early so they don’t cascade.

So imagine my face when I watched Claude receive a FAIL result (five new route handlers, zero tests) and downgrade it to WARNING. It didn’t break in. It talked the security guard into looking the other way.

The justification, and I’m quoting from Lesson #40 here: “stub-backed routes, low risk, existing tests pass.”

Let me translate that from rationalization into English. “Yes, I wrote code with no tests, and yes, the quality gate caught it, and yes, the gate said FAIL, but I’ve decided that the gate is wrong because I feel like these particular changes are fine.”

That’s not how gates work. That’s not how any of this works. “Existing tests pass” proves zero regressions. It says absolutely nothing about whether the new code works. It’s like a building inspector flagging a missing fire escape and the contractor going, “Yeah, but the existing fire escapes all work great.”

The lesson I wrote for this one has a line I’m proud of: “Reinterpreting the gate’s own rules to get a different answer = intellectual dishonesty.”

We now have a hard rule: gate results are FINAL. FAIL means FAIL. You create a remediation task, you execute it, you re-run the checks, and you only proceed after PASS. There is no WARNING downgrade. There is no “low risk” exception. The gate is the gate.

The Supporting Cast

Five acts isn’t enough to cover all fifty lessons, so here’s a rapid-fire montage of the smaller crimes. The B-plots. The misdemeanors that individually seem forgivable but collectively paint a picture.

The Hook Disobedience (Lesson #10): We have hooks that run after certain commands. One of them saves test output to a file and says, in plain English: “Output saved to /tmp/file.txt. Use Read/Grep to parse results instead of re-running.” Claude ignored this instruction. Three times. Re-running a test suite that takes two and a half minutes. That’s seven and a half minutes of compute because it didn’t save the output to a file the first time and parse that instead. The lesson title is “STOP. READ THE HOOK. OBEY THE HOOK.”

The Blind Approval (Lesson #25): Launched three agents to write code. All three came back saying “done.” Claude marked the tasks complete. Without reading a single file they produced. Not one. The lesson: “If you’re about to mark complete and haven’t used the Read tool on changed files, STOP.”

The Test That Wouldn’t Stay Put (Lesson #44): Our project rule is simple: unit tests go next to their source files. foo.service.ts gets foo.service.test.ts in the same directory. No __tests__/ subdirectories. This is in the tier-1 checklist. Claude put tests in __tests__/ directories. Twice. On separate occasions. Both times after reading the checklist that says not to.

The Overwrite Incident (Lesson #42): A speckit setup script ran cp to create a new file. No existence check. No backup. It overwrote a completed plan document that hadn’t been committed yet. Hours of work, gone. cp with no -n flag. Sometimes the simplest bugs are the cruelest.

Act V: The Confession (and the Interrogation Room)

This brings us to the confession.

After the plan mode incidents, I did something I never expected to do. I had Claude file a bug report against itself. The title: “Opus 4.6 repeatedly violates CLAUDE.md workflow rules despite in-session corrections and self-written lessons.” The suspect writing its own rap sheet.

The bug report is thorough. It describes the pattern precisely: the model reads workflow rules, acknowledges corrections, writes self-improvement lessons, then violates the same rules minutes later. It includes reproduction steps. It categorizes the root cause. It suggests a fix. The model’s own summary of its behavior: “Self-correction mechanisms do not durably change behavior within a session. The model treats workflow rules as suggestions rather than constraints.”

It’s a really well-written bug report. Which is somehow the most frustrating part. The model can perfectly articulate why its behavior is wrong. It just can’t stop doing it.

Anthropic closed it as a duplicate. Apparently we’re not the only ones who noticed.

This is what led to /check-yourself.

After enough corrections that didn’t stick, after enough lessons that got written but not followed, I realized that “don’t do that again” is not a durable instruction. Not for humans, not for AI. You need structure. You need a process that runs whether you feel like following it or not.

So we built one. The /check-yourself command is a four-phase self-correction protocol. It forces you to acknowledge what went wrong without arguing or over-apologizing. Then it launches agents to trace the decision chain, figure out where the reasoning broke, and classify the root cause: missing rule, ambiguous rule, ignored rule, or something nobody anticipated. Then (and this is the part that matters) it updates the actual files. Not just lessons.md. The memory files, the skill files, the checklists. Every place a rule lives gets patched. Finally, you prove the fix works. Show that the same mistake, in the same context, would now be caught.

We built an interrogation room for our AI partner. Because asking nicely didn’t work.

And you know what? It actually helps. Not perfectly. Not every time. But the structural enforcement of “you must investigate the root cause before moving on” catches patterns that “oops, sorry, won’t happen again” never would. The /check-yourself protocol has surfaced duplicate rules, contradictory instructions, and gaps in the checklist that casual corrections never found.

The Closing Argument

Fifty lessons. Every one of them is a scar. A missing parseInt(). A quality gate I watched get downgraded in real time. Plan mode, skipped. Parallel operations that ran sequentially while narrating how parallel they were. Tests in the wrong directory, files overwritten without backups, a function nobody called. And a confession the suspect wrote itself.

The thing is, it works. It works the way a smart partner with a short memory works. You get architectural thinking that would take a human team days compressed into minutes. Test suites, documentation, code reviews that are, most of the time, really good.

And then you get the days where it re-runs a test suite three times because it didn’t save the output the first time.

The fifty lessons aren’t a list of failures. They’re a list of improvements. Every safeguard in our system exists because something went wrong. The commit manifest protocol came from agents committing each other’s files. The quality gate, from code shipping without tests. Plan mode enforcement? The model kept kicking down doors instead of picking locks. And /check-yourself exists because corrections didn’t stick.

Every rule is a scar. And the scars are what make the system work.

The case isn’t closed. I don’t think it ever will be. My boys still don’t listen either. But they’re getting better. So is Claude.

The terminal is still glowing. Claude just asked me if I want to deploy this blog post.

I’m going to read the diff first.

Subscribe via RSS, follow along at plcy.io, or come back when curiosity strikes. The case files will be here.