We Explored Why AI Agents Fail on Real Codebases. The Biggest Problem Happens Before Execution.

We started by trying to figure out why AI agents feel unreliable in production.

The obvious problems are infrastructure: auth, permissions, observability, reliability. At 85% per-action accuracy, a 10-step workflow succeeds roughly 20% of the time. Those are real problems and they're being worked on by good teams.

But what we kept seeing, across different agents and different codebases, was something simpler: agents fail before they even start.

Point an AI coding agent at a 10,000-file monorepo and tell it to fix a bug. It will read files. Lots of files. It will trace imports, scan directory trees, and build a mental model from raw source. Then it will produce a patch that technically compiles but misses the project's i18n conventions, edits a generated file that gets overwritten on build, and ignores the co-change coupling between two files that always need to move together.

The agent had access to every file. It still didn't know how the project works.

That led us to a different question: what if the biggest lever isn't execution reliability, but orientation? What if the reason agents feel dumb on real codebases is that they start with the wrong understanding of the project?

We built sourcebook to test that hypothesis — and the results were interesting enough to write up properly.

The orientation tax

AI coding agents spend most of their token budget on orientation. Reading files, understanding structure, figuring out where things live. By the time they start reasoning about the actual task, the context window is already crowded with file contents that don't help.

This isn't a model problem. It's a knowledge problem. There's a difference between code (what exists in the repo) and project knowledge (how the repo actually works). Conventions. Dominant patterns. Which files are fragile. Where changes tend to go. What was tried and reverted. Which files always change together even though they live in different directories.

THE KNOWLEDGE STACK

Project Knowledge conventions, patterns, constraints, workflow rules

The missing layer

Runtime Tools MCP, function calling, dynamic file access

Code repo dumps, file concatenation, raw source

^ The missing layer

"More context" is not "better context"

The intuitive solution is to give the agent more information. Dump the whole repo. Concatenate everything. Let the model sort it out.

This actively makes things worse. The ETH Zurich AGENTS.md study (February 2025) tested auto-generated context files on real agent benchmarks and found that files restating obvious information degraded agent performance by 2-3% compared to no context at all. The mechanism is straightforward: obvious context competes for attention with task-relevant information. Every token spent confirming that the project uses TypeScript is a token not available for understanding the i18n conventions the agent actually needs.

The strongest context is not the biggest context.

The experiment

We wanted to know what actually helps. Not in theory. On real issues, with real patches, measured against real outcomes.

Setup

We pulled closed GitHub issues with merged PRs from three repos:

● cal.com — TypeScript monorepo, 10,453 files. Enterprise scheduling platform.
● pydantic — Python library. Core data validation.
● hono — TypeScript web framework. Minimal, fast.

We tested four context conditions:

None — no context file at all. Just the agent, the issue description, and the codebase.
Handwritten — a CLAUDE.md written by a human who understands the project.
Repomix — the repo concatenated into a single file. The current market default at 22K GitHub stars.
sourcebook — our tool, tested at v0.3 and v0.5.

Same model (Claude Sonnet) for every run. Same prompt template. Same maximum turn count. We measured wall-clock time, files touched, and lines changed — then manually reviewed each patch for correctness and completeness.

What happened

BENCHMARK OUTCOME MATRIX 4 TASKS / 5 CONDITIONS

Task	None	Handwritten	Repomix	sb v0.3	sb v0.5
cal.com #27298 OAuth i18n strings	241s / 5f / 381L	120s / 3f / 321L	176s / 3f / 283L	140s / 3f / 350L	136s / 6f / 469L
cal.com #27907 PayPal i18n strings	94s / 5f / 314L	115s / 5f / 362L	—	103s / 5f / 390L	113s / 6f / 458L
hono #4806 Request body caching	253s / 1f / 31L	323s / 2f / 86L	—	—	267s / 2f / 101L
pydantic #12715 JSON schema fix	312s / 1f / 99L	180s / 2f / 83L	—	—	308s / 2f / 117L

s = secondsf = files touchedL = lines changedgreen = best speed

Handwritten was fastest on cal.com #27298 by a wide margin. The human brief encoded the exact i18n workflow: use useLocale(), call t("key"), add translation keys to the common locale file. The agent didn't have to discover any of this — it was handed the recipe.

sourcebook v0.5 was 3% slower than v0.3 on time but touched 6 files and produced 469 lines — a 36% broader patch than the handwritten condition. It found instances the handwritten brief didn't flag.

On cal.com #27907, no context was actually fastest. The task was straightforward enough that orientation overhead was low. But more notably: sourcebook v0.5 (113s) was faster than the handwritten brief (115s). First time we saw that. And it produced a broader patch — 6 files, 458 lines vs. 5 files, 362 lines.

On hono and pydantic, different dynamics emerged. The handwritten brief was slowest on hono — the extra context sent the agent down a more thorough path, which took longer but produced a more complete fix. sourcebook split the difference: faster than handwritten, more complete than no-context.

Why handwritten briefs won (at first)

Handwritten briefs are a monster baseline. A human who understands a project encodes something remarkably dense in a few hundred words: not just what the code looks like, but how the project works. The workflow. The conventions. The knowledge that takes weeks to absorb from code alone.

INSIGHT COMPARISON

Handwritten Brief

>useLocale() + t("key") for all strings
>Keys in packages/i18n/locales/en/common.json
>paypal_setup_* naming convention
>Auth flow touches two specific directories
>Don't edit prisma files directly

Workflow recipes

sourcebook v0.3

>Hub: types.ts imported by 183 files
>14 generated files — do NOT edit
>Circular dep: bookingScenario / getMockRequestData
>Co-change: auth/provider ↔ middleware (88%)
>1,907 dead code candidates

Structural intelligence

The handwritten brief had workflow recipes. sourcebook had structural intelligence. The workflow recipes were more immediately useful for these specific tasks. Repo dumps show the code. Handoffs explain the project.

What sourcebook was missing

sourcebook v0.3 was good at telling agents what NOT to touch and where the structural risks were. It was bad at telling agents what TO DO and how to do it.

Structural intelligence matters — knowing the hub files, the circular deps, the generated file traps. But it's not enough. An agent that knows which files are important still needs to know: when the project needs i18n, you use this exact pattern. When you add a new API route, these are the files that need to change together.

This is the difference between a map and directions. sourcebook got better by losing first.

What changed: v0.4.1 and v0.5

We added three capabilities based directly on what the benchmarks exposed:

1. Dominant pattern detection (v0.4.1)

sourcebook now scans for the recurring code patterns that define how a project works. Not framework detection — dominant patterns are the project-specific conventions that sit on top of frameworks:

● i18n pattern: uses useLocale() + t("key"), translation keys in common.json
● Auth pattern: session handling via packages/auth/, middleware chain
● Validation pattern: Zod schemas co-located with route handlers
● Database pattern: Prisma queries in packages/db/queries/, never raw SQL in routes

2. Repo-mode detection (v0.5)

Not every repo is the same kind of project. v0.5 detects the repo mode and adjusts what it surfaces:

● App mode: emphasizes routes, pages, workflows, env vars
● Library mode: emphasizes public API surface, test patterns, backwards compatibility
● Monorepo mode: emphasizes package boundaries, shared deps, cross-package coupling

3. Quick reference (v0.5)

A new section at the top of the generated file: a 30-second handoff covering the absolute essentials. Stack. Key commands. The one thing most likely to go wrong. Designed for the first 10 seconds of agent orientation.

PROGRESSION: cal.com AVG — TIME (s) AND PATCH LINES

The numbers after the changes

On cal.com #27298: sourcebook v0.5 closed to within 6% of handwritten speed (136s vs. 120s) while producing a 46% broader fix (469 lines vs. 321 lines). The fix touched more files because it found untranslated strings the handwritten brief didn't mention.

On cal.com #27907: sourcebook v0.5 beat handwritten on speed (113s vs. 115s). First time. Small margin, but directionally meaningful.

Across the board, sourcebook v0.5 consistently produced broader patches — finding more instances of issues than any other condition.

What we learned

1. Handwritten briefs encode workflow knowledge that tools struggle to match

A human who understands the project doesn't describe the code — they describe how to work with the code. This is a fundamentally different kind of information. Tools are getting closer but there is still a gap, especially on project-specific workflows with no standard pattern.

2. Structural intelligence and workflow intelligence are complementary

sourcebook's structural analysis (hub files, co-change coupling, generated file traps) caught things handwritten briefs missed entirely. The 14 generated files in cal.com? The handwritten brief didn't mention those. The 88% co-change coupling? Not in the brief. The 1,907 dead code candidates? Invisible to a human writing from memory. The ideal context file encodes both.

3. The discoverability filter is real

The ETH Zurich finding held up in our tests. Repomix (full repo dump) performed worse than a targeted handwritten brief and sometimes worse than no context at all. Agents don't need more code to read. They need the non-obvious things the code doesn't tell them.

4. Benchmarking agent context is genuinely hard

One of our pydantic tasks was a scoring artifact — sourcebook correctly identified a no-op and we initially scored that as failure. Patch size is not quality. Speed is not correctness. We're still iterating on evaluation methodology, and we think anyone claiming clean benchmark numbers on agent context should show their scoring rubric.

Implications for coding agents

Project knowledge should be a first-class input. Not an afterthought, not a nice-to-have. It should be loaded alongside the task description, before the agent starts reading files. The difference between 241s and 120s on the OAuth task was entirely explained by project knowledge.

Concise beats comprehensive. sourcebook's output for cal.com (10,453 files) is 858 tokens. That's not a compromise — it's the point. Every finding passed a discoverability filter. The 19,680x reduction vs. Repomix is a feature, not a limitation.

Humans should still edit the output. sourcebook generates a starting point, not a finished product. The best context files we've seen are machine-generated, human-reviewed.

Static analysis and git forensics reveal things humans forget. Nobody remembers all 14 generated files. Nobody manually tracks co-change coupling percentages. Nobody counts dead code candidates. The structural layer is where automation has a genuine edge over handwritten briefs.

Limitations and open questions

We want to be upfront about what this is and isn't.

Small benchmark sample. Four tasks across three repos is enough to see patterns but not enough to make statistical claims. We're expanding the benchmark suite and the full data is in the repo. We'd welcome others reproducing these results.

Correctness scoring is still evolving. Evaluating whether a patch is "right" is harder than measuring whether it's fast or big. We're working on better correctness rubrics, but this is an open problem across the field.

Language coverage is early. TypeScript, Python, and Go are supported. Rust, Java, Ruby, PHP are not yet analyzed for dominant patterns.

We built the tool and ran the benchmark. That's a conflict of interest and we know it. The benchmark code is open. The issues and PRs are all public. We've tried to present the results honestly, including the cases where we lost.