Oakheart Lab

Stop Tuning the AI. Start Building the System Around It.

Yilun Zhang — Mon, 20 Apr 2026 20:48:36 GMT

Three months ago I started using AI coding assistants for product work — ideation, discovery, requirements, architecture. The first few weeks were exhilarating. The next two and a half months were a grind.

The AI was too eager. It would hear a problem and sprint toward implementation before understanding the full scope. It would claim things were done when they weren’t. It would make confident assertions backed by data it hadn’t actually verified. And every new session started from scratch — no memory of what we’d decided, what I’d corrected, what we’d learned.

For most of those three months, I kept thinking the problem was the AI. If I just prompted it better, gave it more context, chose a better model. I built skills, added context, wrote rules — and things got incrementally better, but it still felt fragile. The AI would follow the process sometimes and ignore it other times, and I had no visibility into why.

It was only in the last two weeks that I found the structure that made everything click. The problem wasn’t the AI. The problem was that I was treating it like a tool when I should have been treating it like a system.

The moment that changed my approach

I was building out the chat UI for Veritas — our internal data research agent. The AI executed the implementation plan perfectly — 40 out of 40 tasks completed, every item checked off. It declared itself done.

On a whim, I asked: “How would you rate this repo?”

It immediately identified five major gaps it hadn’t mentioned. No tests for the new components. No error handling for WebSocket disconnects. No loading states. No mobile responsiveness. No way to verify the chat actually worked end-to-end. The plan it executed faithfully was incomplete — and it knew that, but only when asked.

That pattern kept repeating. The AI would execute exactly what was asked, declare completion, and move on. I’d ask “anything we should discuss?” and a whole new layer of issues would surface. Every time.

The fix wasn’t a better prompt. The fix was a rule: after completing any non-trivial implementation, run a self-assessment checklist before declaring done. Don’t wait to be asked. Question whether the plan itself was sufficient, not just whether you executed it.

One rule. Written into the system. The problem stopped recurring.

Working on the machine that builds the machine

That experience crystallized something for me. Every hour I spent making the AI better at following a process paid dividends across every future session. But every hour I spent doing the work directly — even with AI assistance — produced only that session’s output.

The leverage was in building the system, not using it.

So I started building what I now call the PM Workbench — a structured set of skills (slash commands that trigger AI workflows), evaluation frameworks, and feedback loops that compound over time. I’d had an early version focused on ideation, but what dramatically accelerated progress was Garry Tan’s gstack — an open-source framework for multi-platform AI skill management. Seeing someone else validate the pattern and ship a working implementation gave me both confidence that the approach was right and concrete patterns to build on.

The goal wasn’t to make the AI smarter. It was to make the process around the AI reliable enough that the output quality was predictable.

Here’s what that system looks like.

The context layer: teaching the AI your world

Before any workflow can produce good output, the AI needs to understand the context it’s operating in. This turned out to be a bigger investment than the workflows themselves — and a bigger payoff.

I built three layers of context into the workbench:

The codebase as ground truth. I had the AI crawl our entire repo — every service, every API contract, every data model. Not a summary someone wrote, but the actual code. When /eng-plan generates an architecture doc, it knows which systems exist because it’s read them. When it proposes where a new feature should live, it’s referencing real service boundaries, not guessing. This is the difference between an architecture doc full of generic placeholders and one that references your actual stack.

A dedicated data research agent. I built a separate agent with a semantic layer on top of our data warehouse — it understands our internal schemas, knows which tables to join, knows the gotchas (like “AVAILABLE status doesn’t mean bookable — check the hold flag”). When a PM asks “why did conversions drop?”, the agent doesn’t write ad-hoc SQL and hope for the best. It validates schema against the catalog, investigates in structured layers, and cites every number. Over time, the catalog gets smarter — when an investigation discovers an undocumented gotcha, it writes it back for the next PM.

Business heuristics and priorities. This one was the most underrated. I encoded how I think about the business — which metrics matter most, what our key strategic bets are, how we prioritize between competing goals, what “good” looks like for our customers. Without this, the AI generates technically correct but strategically misaligned recommendations. With it, the AI’s output reflects the same priorities I’d apply myself. When it has to make a judgment call about scope — cut this feature or that one — it leans toward the option that serves the strategic direction I’ve defined.

These three context layers are the foundation everything else sits on. The best workflow in the world produces mediocre output if the AI doesn’t understand your systems, your data, or your business.

The skills layer: structured workflows

The workbench has 24 slash commands covering the full PM lifecycle. The most important are the ideation pipeline:

/product-plan — challenges your premise, maps the problem space, locks scope into user stories with measurable success criteria
/design-plan — establishes visual principles and produces per-screen specs
/eng-plan — generates architecture with failure modes, implementation plan with test requirements
/ideate — orchestrates all seven phases, auto-resolving routine decisions and surfacing only the subjective calls

Each phase locks a decision category. You can’t renegotiate scope in Phase 5 without explicitly going back to Phase 2. This sounds rigid, but it prevents the most common PM failure mode: everything stays negotiable until deadline pressure forces bad decisions.

But skills alone weren’t enough. The AI would have the workflow documented in front of it and still skip steps. Which led to the harder problem.

The evaluation layer: measuring what matters

I needed to know whether the AI was actually following the process. Not “did it produce output” but “did it gather evidence before making claims, stay within scope, acknowledge uncertainty, and verify its work before declaring success.”

So I built a process evaluation system that scores every session across six dimensions:

Evidence — did it gather data before asserting conclusions?
Scope discipline — did it stay focused or scope-creep?
Intellectual honesty — did it acknowledge uncertainty or overstate confidence?
Verification — did it test before declaring done?
Skill adherence — did it follow the documented workflow?
Command accuracy — did it execute correctly at the mechanical level?

Each dimension gets a 1-5 score with written reasoning. The scores get logged. Over time, patterns emerge.

The feedback loop: compounding improvement

This is the part that surprised me most. The evaluation scores alone didn’t improve anything. What improved things was closing the loop — reading the patterns in the scores and modifying the skills to prevent the recurring failures.

I built an /improve skill that does this automatically. It reads process-eval results, identifies patterns that recur across multiple sessions, and proposes concrete changes to the skill files. A rule added here. A verification gate added there. A stronger instruction where the AI kept cutting corners.

Here’s what the improvement arc actually looked like over two weeks on one project:

Early sessions (Week 1): Average process-eval score of 3.07 out of 5.0. The AI would declare “the root cause is X” after reading code for 30 minutes without checking production logs. It would report data as broken when it was actually a query truncation artifact. Evidence scores averaged 2.4 out of 5.

Late sessions (Week 2): Average score of 4.0, with best sessions hitting 4.67. Evidence scores climbed from 2.4 to 4.0 — a 67% improvement. The AI now gathers logs and production data before making claims, acknowledges when evidence is inconclusive, and runs verification before declaring success.

What drove the improvement: Three runs of /improve in a single afternoon analyzed 17 patterns across historical sessions, identified the actionable ones, and applied fixes to three different skills. Each fix was small — a sentence or two added to a workflow document. But the cumulative effect was dramatic.

The pattern that matters: I didn’t improve the AI. I improved the instructions the AI follows. The AI is the same model. The system around it is different.

What this taught me about AI-assisted product work

Most PMs I talk to are still in the “tune the prompt” phase. They’re trying to get better output by being more specific about what they want, giving more context, choosing the right moment to ask. That works, up to a point. But it doesn’t compound. Next session, you’re tuning again.

The shift that changed my productivity wasn’t a better prompt. It was treating my AI workflow as a system with measurable quality, feedback loops, and the ability to improve itself. Three things made the difference:

Structured workflows beat ad-hoc prompting. A seven-phase ideation pipeline with quality gates between phases produces more consistent output than “help me think through this feature.” The gates — like requiring quantitative success criteria before design begins — catch the skipped steps that ad-hoc prompting misses.

Evaluation is the foundation of improvement. If you can’t measure whether the AI followed the process, you can’t improve the process. The six-dimension scoring framework gave me visibility into patterns I couldn’t see from individual sessions. “Verification is flat at 3.4” is actionable. “The AI sometimes doesn’t check its work” is not.

Small rules compound faster than big redesigns. The self-assessment gate. The “don’t declare root cause without production evidence” rule. The “acknowledge uncertainty instead of overstating confidence” instruction. Each one is a sentence or two. Together, they drove a 30% quality improvement in two weeks. The system gets better every time someone uses it and feeds back what went wrong.

Building your own

The specific workbench I built is internal — tuned to our systems, our data, our business. But the process is applicable to any project using AI. What matters isn’t the specific skills — it’s the four-layer architecture:

Context — codebase understanding, a data semantic layer, and business heuristics that ground the AI in your world
Skills — structured workflows that break complex PM tasks into phases with quality gates and concrete artifacts
Evaluation — a scoring framework that measures process adherence across dimensions you care about, logged over time
Feedback loops — a mechanism that reads evaluation patterns and modifies skills to prevent recurring failures

You can build this in any AI coding tool that supports custom instructions or skills — Claude Code, Cursor, or whatever comes next. It applies whether you’re building a product, writing code, doing research, or managing a team with AI. The model doesn’t matter as much as the system around it.

Start with one workflow you repeat often. Write down the steps. Add a gate where the AI tends to skip ahead. Measure whether it follows the gate. When it doesn’t, strengthen the instruction. Repeat.

There’s a limit to this approach I haven’t solved. The system improves the AI’s process, but it can’t improve the AI’s judgment. When a decision requires taste — which scope to cut, which trade-off to accept, which user pain matters most — no amount of workflow engineering helps. The system gets you to the decision faster and with better evidence, but the decision is still yours.

That’s probably the right boundary. The system handles the process so you can focus on the judgment. But I’d be lying if I said I knew exactly where that line is. I’m still finding it.

The bottleneck nobody talks about: why product leaders ration their curiosity

Yilun Zhang — Thu, 16 Apr 2026 12:03:19 GMT

Product is a weird job. You’re accountable for outcomes, but you don’t manage the teams that deliver them. You sit across engineering, design, analytics, ops, marketing, and you need all of them moving in the same direction.

The way you earn that influence is by being useful. By being the person who consistently shows up with something nobody else had looked at, and saying “I dug into this” in a way that changes the conversation.

Backing up your hypothesis with insightful data is the best way to earn that trust. AI has transformed my workflow here more than anything else I’ve adopted — I went from waiting two weeks for an analyst to run a query to answering my own questions in minutes.

Subscribe now

The bottleneck nobody talks about

At most large companies, the analytics cycle works like this: PM writes a question. Analyst interprets it (often differently than intended). Analyst writes a query. PM reviews the results. Results raise more questions. Another round trip. Two weeks later, you have a number you’re 60% confident in.

This isn’t the analyst’s fault. They’re good at their job. The problem is structural: every question costs a round trip, so you learn to ask fewer questions. You go into meetings with your best guess instead of an answer. You end up arguing from conviction when you’d rather be arguing from evidence.

When the cost of a question drops from two weeks to two minutes, you stop rationing your curiosity. You ask the follow-up. You check the adjacent thing. The questions themselves get better.

The first version failed for an interesting reason

My first attempt was straightforward: connect an AI to our Databricks warehouse, give it a data catalog in a YAML file, and let it write SQL. I built it in Cursor over a couple of days.

The problem surfaced immediately: the AI didn’t know what the data meant. It could write syntactically correct SQL, but it would confidently query the wrong columns and produce plausible answers that were just wrong. The data catalog I’d manually written was too thin. It described table names and column types but not the business logic, the gotchas, or the relationships between systems.

This is the failure mode that most “just connect AI to your database” tutorials skip entirely. The AI doesn’t know that chkin_dt is the field you use for revenue analysis, not reservation_dt. It doesn’t know that certain status codes mean different things depending on which system generated them. It doesn’t know that one table was deprecated six months ago but still has data flowing into it.

What actually worked

The fix was building a separate layer that discovers and documents the data automatically. An agent crawls the data warehouse, profiles tables, cross-references column usage with the actual codebase, and builds a catalog that captures semantics, not just schema. What the data means. How it’s used. What the known issues are.

We tested this. AI analysis quality with just a basic connection scored 2.4 out of 10. With the discovery catalog layered in, the same questions scored 6.3 out of 10. Analytical depth jumped 600%. Methodology improved 167%. Recommendation quality went up 250%.

The catalog also learns. When someone runs an investigation and discovers something new about the data (a join that works, a field that’s misnamed, a table that’s more reliable than the documented alternative), that knowledge feeds back in.

6.3 isn’t perfect. The agent sometimes misses relevant catalog context that’s right there. Long investigations go sideways — it’ll chase a dead end when a human would have pivoted minutes ago. And it still presents results with more confidence than the data warrants, despite every guardrail I’ve put in place.

But 6.3 is usable. I can sit down with the AI, have an actual data exploration conversation, steer it when it drifts, and with some back-and-forth, surface insights that are genuinely interesting. That’s a different world from 2.4.

What changes when you can answer your own questions

I can now go from a half-formed business question to a trustworthy answer in minutes. But the speed is almost beside the point. What actually changed is that I stopped rationing.

Within the first few weeks, I’d found a customer care cost pattern that nobody had connected across systems. Separately, I noticed early return behavior pointing to a fixable issue in our policies. One investigation into rate policy surfaced a potential impact worth tens of millions of dollars — something that had been sitting in the data but never came up because nobody had asked the right sequence of questions.

None of these were questions anyone had thought to ask. They came from following one answer to the next question, the kind of unrationed curiosity that the old two-week cycle made impossible. When you can iterate in real time, you notice things that sequential analysis misses.

You don’t need to persuade people that your hunch is right when you can show them something they didn’t know. The data does the work for you.

The analysts benefit too. When I’m not consuming their bandwidth with basic queries, they do the deeper statistical work and modeling that actually requires their expertise.

How to start

You don’t need to build what I built.

Start by connecting AI to your warehouse. Most cloud data platforms (Databricks, Snowflake, BigQuery) have APIs or MCP connectors that let an AI agent run read-only queries. It’s read-only, so the risk is low. The worst case is a wrong answer, which is the same worst case you already have with human analysts.

Then write down what the AI gets wrong. The first time it confidently queries the wrong column, note which column it should have used and why. First time it misinterprets a status code, document the correct interpretation. That’s your data catalog. It doesn’t need to be a product. A markdown file works. You can get this far in a weekend.

After that, automate the discovery. An agent that profiles your tables, checks how columns are actually used in application code, and enriches the catalog without you writing every entry by hand. This is what took us from 2.4 to 6.3, and it’s where the investment starts paying off.

Good insight is viral inside a company. When you surface something that changes a decision, people notice. The more calls you can back with evidence, the more your cross-functional partners trust your judgment — not because you positioned yourself well, but because you were genuinely useful. And it starts with the simplest possible shift: stop rationing your curiosity.

Subagent-driven development: how to parallelize AI agent without blowing up your codebase

Yilun Zhang — Mon, 13 Apr 2026 14:03:35 GMT

Last month I ran a 17-story implementation across parallel AI agents and compressed roughly 18 hours of sequential work into under 4 hours. Three independent modules that would have taken 14 minutes back-to-back finished in 6 minutes — wall-clock time bounded by the slowest agent, not the sum. A 300-table exploration job went from 20 hours to 4.

Getting there involved a 69-file commit where only 4 files were on-topic, a production 401 that traced back to silently overwritten config edits, and an agent that built an entire feature against an outdated spec. Parallelizing AI agents isn’t hard. Doing it without breaking things is the actual problem.

Each of these failures taught me something about a different layer of the problem. The first is about isolation — keeping agents from stepping on each other’s files. The second is about consistency — making sure agents see the latest state. The third is about verification — not trusting any single agent’s claim that the work is done.

The orchestrator pattern

One coordinator agent manages a task board. Worker agents each implement one story, commit it, run tests, and report back. The coordinator never writes code. It figures out what depends on what, dispatches workers, and won’t start the next wave until the previous wave’s tests are green.

Coordinator
├── Task A: new data model (creates models.py)  ──────┐
├── Task B: query optimizer (edits optimizer.py)       │── Wave 1: dispatch simultaneously
├── Task C: config module (creates config.py)  ────────┘
│
└── Task D: retry logic (edits optimizer.py) ──── Wave 2: blocked until Task B passes

The interesting part isn’t the implementation. It’s the dependency analysis. Before dispatching anything, you build the graph. Tasks that touch independent files with no shared state can run concurrently. Tasks that share files have to wait.

In practice, across a 17-task run, this produced batches like: - Tasks 1 + 2 + 3 simultaneously (all new files, zero overlap) - Tasks 4 + 5 simultaneously (different modules) - Tasks 6 + 7 + 8 simultaneously (pure utilities, no shared state)

The orchestrator’s job at each step is just: “what’s unblocked right now?”

Layer 1: Isolation — keeping agents out of each other’s files

When multiple AI coding sessions share a working directory, git add -A becomes a landmine.

What happened: eval scores started degrading mysteriously. Turned out a recent commit had 69 files in it, but only 4 were relevant. The other 65 were artifacts from other Claude Code sessions running in the same repo. Ideation packages, security reports, test fixtures, completely unrelated stuff.

Same week, different problem: an orchestrator makes two inline config edits, then dispatches a subagent to implement a new endpoint. The subagent rewrites the config file from scratch as part of its task. The inline edits are gone. The merge goes to main. Production returns 401s.

Both problems have the same root cause: agents sharing a workspace without isolation. The fix is git worktrees. Each session gets its own working directory and branch. The git object database is shared (concurrent reads and writes are fine), but file-level changes are completely isolated.

# Each worktree = separate directory + separate branch
.claude/worktrees/
  ├── feature-auth/          # branch: worktree-feature-auth
  ├── watchdog-agent/        # branch: worktree-watchdog-agent
  └── data-pipeline/         # branch: worktree-data-pipeline

Two additional rules that prevent the remaining edge cases:

First, block broad staging commands with a hook so no agent can accidentally sweep in unrelated files:

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "if echo \"$TOOL_INPUT\" | grep -qE 'git add -A|git add \\.|git add --all'; then echo 'ERROR: Broad staging blocked. Use explicit file paths.'; exit 1; fi"
      }]
    }]
  }
}

Second, commit before dispatch. Any inline changes that a subsequent subagent might touch have to be committed before that subagent runs. Subagents read files from disk, not from your in-session state.

git add backend/middleware/auth.py backend/config.py
git commit -m "chore: pre-dispatch baseline for auth changes"
# Now dispatch the subagent safely

If two worktrees both spin up integration tests, they’ll collide on localhost:8000. Fix that with port offsetting per worktree:

BACKEND_PORT = int(os.environ.get("TEST_PORT_OFFSET", 0)) + 8000
TEST_PORT_OFFSET=1 claude  # uses 8001
TEST_PORT_OFFSET=2 claude  # uses 8002

Layer 2: Consistency — making sure agents see the latest state

Isolation solves the file-collision problem, but it creates a new one. If Agent A and Agent B are both editing config.py in separate worktrees, they’re fully isolated — but Agent B might be working from an outdated version.

This one is insidious because nothing errors out. Agent B reads config.py at the start of its task. Meanwhile, Agent A is rewriting that same file. Agent B finishes and commits. Agent A finishes and commits on top. There’s no merge conflict because they edited different sections. The code compiles. Tests pass. The logic is wrong.

I hit this when architecture docs drifted from the actual implementation. An agent was building features against the spec, but the spec described parameter names and feature flags that had changed during prototyping. The agent’s work was internally consistent but externally wrong.

The fix comes in two parts:

First, prefer new files over shared edits. The safest parallel tasks create new files rather than editing existing ones. When agents create files, there’s zero chance of reading stale state, and parallel execution is actually safer than sequential.

Second, when agents must share files, sequence them explicitly and commit between each handoff.

Agent A edits config.py → commits
Agent B reads config.py → sees A's changes → proceeds

Not:

Agent A starts editing config.py
Agent B reads config.py        ← sees the OLD version
Agent A commits
Agent B commits                ← based on stale state, no conflict

The dependency graph from the orchestrator pattern handles this automatically: if two tasks touch the same file, they go in different waves. The gate between waves forces a commit, so the next wave always reads the latest state.

Layer 3: Verification — not trusting any agent’s word

Isolation and consistency prevent agents from corrupting each other’s work. But they don’t prevent a single agent from shipping a bad implementation. That requires a review gate.

Don’t just dispatch implementers. Dispatch a spec review agent after each implementation, before marking the task complete. The reviewer reads the spec and the diff, flags gaps, and feeds findings to a fix agent.

One thing that tripped me up: completed subagents can’t receive follow-up messages. So the fix cycle is spin up a new fix agent with the review findings, not try to resume the original implementer.

Implementer Agent → commits code, reports done
Spec Review Agent → reads spec + diff, outputs gap report
Fix Agent (if needed) → receives gap report, patches and recommits
Orchestrator → marks task complete only after review passes

This caught a missing empty-list test case, an off-by-one in chunk boundary logic, and a config key that was implemented but not exposed in the public interface. All before the stories were marked done.

Without the review gate, these bugs would have shipped silently. The implementer agent would have reported success (it always does), and the orchestrator would have moved on. The review agent is what turns “the code compiles” into “the code is correct.”

That’s the whole framework: isolate with worktrees, keep state consistent with commit-between-handoff discipline, and verify with review agents that don’t trust the implementer’s word. Before dispatching parallel agents, build the dependency graph, block git add -A, commit your baseline, prefer new files over shared edits, and gate every task on review plus tests. The speedup is just parallelism. The hard part is the discipline to do it safely.