Here's a result from our benchmark that we didn't expect.
We asked an AI agent to fix a rendering bug in bubbletea, a Go terminal UI library. We gave it expert-written context about the project — the kind of brief a strong engineer would write for a new teammate. We ran it four times.
The results: 21 seconds. 31 seconds. 44 seconds. 252 seconds.
Same agent. Same model. Same context. Same bug. Three runs found the fix fast. One run wandered for over four minutes.
Then we ran the same task with sourcebook-generated context — a structural map of the codebase extracted automatically from import graphs and git history. Four runs: 34 seconds. 34 seconds. 41 seconds. 57 seconds.
Not faster on the best run. But the bad run never happened.
This pattern repeated across our benchmark. And it changed how we think about what sourcebook actually does.
The experiment
19 real, closed GitHub issues across 10 open-source repos in 4 languages. Each task has a merged PR fix. The harness checks out the codebase to the state before the fix, validates the bug is still present, injects a context file (or none), and runs the agent.
Model: Claude Sonnet 4 (pinned). Max turns: 50. No human intervention. Then repeat runs on the top 5 tasks — 4 runs per condition — to measure variance.
Three conditions:
- None — just the issue description and the codebase.
- Handwritten — a CLAUDE.md written by a developer who understands the project.
- sourcebook — auto-generated structural context. Hub files from import graph analysis, co-change coupling from git forensics, convention signals. No stack labels, no project structure, no standard commands — only what agents can't discover by reading files.
Repos: cal.com, hono, pydantic, fastapi, gin, chi, bubbletea, clap, next.js, vercel/ai — spanning enterprise monorepos, Python libraries, Go web frameworks, and Rust CLI tools.
What the single runs showed
In the initial benchmark (one run per condition), sourcebook was faster than handwritten context on 13 of 19 tasks. The average gap was 16%. On first glance, this looks like a speed story.
| Task | Lang | None | Handwritten | sourcebook | SB vs HW |
|---|---|---|---|---|---|
| next.js #74843 | TS | 107s | 161s | 30s | -81% |
| bubbletea #1322 | GO | 135s | 252s | 57s | -77% |
| clap #6201 | RS | 192s | 300s | 106s | -65% |
| fastapi #14508 | PY | 278s | 191s | 70s | -63% |
| pydantic #12715 | PY | 210s | 294s | 197s | -33% |
| gin #4468 | GO | 102s | 119s | 81s | -32% |
| vercel/ai #13839 | TS | 177s | 185s | 144s | -22% |
| drizzle #4421 | TS | 179s | 187s | 149s | -20% |
| cal.com #27907 | TS | 103s | 146s | 126s | -14% |
| clap #6275 | RS | 155s* | 209s | 192s | -8% |
| gin #2959 | GO | 184s | 196s | 182s | -7% |
| vercel/ai #13988 | TS | 274s | 221s | 209s | -5% |
| fastapi #14454 | PY | 237s | 261s** | 254s | -3% |
| cal.com #27298 | TS | 148s | 140s | 141s | +1% |
| chi #954 | GO | 258s | 133s | 134s | +1% |
| pydantic #13051 | PY | 222s | 150s | 168s | +12% |
| fastapi #14483 | PY | 221s | 202s | 227s | +12% |
| vercel/ai #13354 | TS | 99s | 167s | 276s | +65% |
| hono #4806 | TS | 197s | 232s | 393s | +69% |
*clap #6275 no-context: agent produced 0 lines (gave up). **fastapi #14454 handwritten: agent edited the wrong file (0% file overlap with reference PR).
But single runs lie. We knew that. So we ran the top 5 tasks four times each.
What the repeat runs showed
The speed advantage mostly disappeared. No individual result was statistically significant at 95% confidence. The variance within each condition is too high for N=4 to detect a reliable speed difference.
But something else appeared in the data.
| Task | Condition | N | Mean | Range | StdDev |
|---|---|---|---|---|---|
| next.js #74843 | sourcebook | 4 | 40s | 30–48s | 7s |
| handwritten | 4 | 72s | 29–162s | 62s | |
| bubbletea #1322 | sourcebook | 4 | 42s | 34–57s | 11s |
| handwritten | 4 | 87s | 21–252s | 110s | |
| clap #6201 | sourcebook | 4 | 193s | 106–310s | 86s |
| handwritten | 4 | 194s | 121–300s | 78s | |
| gin #4468 | sourcebook | 5 | 97s | 82–110s | 11s |
| handwritten | 4 | 96s | 80–119s | 17s | |
| pydantic #12715 | sourcebook | 3 | 210s | 199–224s | 13s |
| handwritten | 3 | 206s | 194–225s | 17s |
Look at the range column. sourcebook produced the tighter range on 4 of 5 tasks. The means are close. The variance is not.
That bubbletea result — 21 to 252 seconds with handwritten context — is a 12x spread. Same agent, same brief, same bug. The agent sometimes found the right path immediately. Sometimes it wandered for four minutes.
With sourcebook, the range was 34 to 57 seconds. The bad run never happened. Not because the agent got lucky. Because the structural map — hub files from import graph analysis, module boundaries, where the rendering pipeline lives — gave it a path that doesn't depend on luck.
The mechanism
Agents don't fail because they're slow. They fail because they don't know where to start.
Point an agent at a 10,000-file monorepo and tell it to fix a bug. It will read files. Lots of files. It will trace imports, scan directory trees, try to build a mental model from raw source. Sometimes it finds the right file on the second try. Sometimes it reads 30 files before it gets oriented. That's the variance. That's where the 252-second runs come from.
Handwritten context helps — on average. But human-written briefs describe how a framework works in general. They say things like "tests are in __tests__" and "we use tRPC for API routes." That's useful. But it doesn't tell the agent which file is the structural center of the rendering pipeline, or which files always change together, or where the import graph converges.
Structural context is different. It's a map. It says: this file is imported by 52 other files. This file and that file change together 88% of the time. The rendering logic converges here. An agent with a map doesn't explore. It navigates.
That's the mechanism. Not speed. Removal of wasted motion.
Three things the data proves
1. Sourcebook reduces variance
Across 5 repeat-run tasks, sourcebook produced the tightest completion range on 4 of them. The agent finds a reliable path instead of sometimes-fast-sometimes-wandering.
next.js: 30–48s (SB) vs 29–162s (HW). bubbletea: 34–57s vs 21–252s. gin: 82–110s vs 80–119s. pydantic: 199–224s vs 194–225s. The pattern is consistent: the structural map eliminates the long tail.
2. Sourcebook prevents failure modes
clap #6275 (Rust): Without context, the agent produced 0 lines — it gave up. With sourcebook, it found and fixed the bug. 86 lines, 100% file overlap with the reference PR. Context was the difference between failure and success.
fastapi #14454: Handwritten context — written by a developer who understands FastAPI — led the agent to the wrong file. 0% file overlap with the reference PR. sourcebook led it to the correct file. 100% overlap. Human knowledge of a framework isn't always enough. The import graph found what the human missed.
These aren't speed differences. They're success vs failure. The agent didn't just wander slowly — it went the wrong direction entirely.
3. Sourcebook matches handwritten context — without manual effort
On repeat runs, sourcebook's mean completion time is within noise of handwritten context on every task. gin: 97s vs 96s. pydantic: 210s vs 206s. clap #6201: 193s vs 194s.
The handwritten briefs took time to write. One of them pointed at the wrong file. sourcebook generates in 3 seconds, for any repo, in any language. Matching human performance at zero marginal cost is the product.
By language
Structural context helps most where the module system is implicit and agents need a map to navigate.
| Language | Tasks | Signal |
|---|---|---|
| Go | 4 tasks | Strongest effect. Import graph analysis gives agents the structural map Go doesn't surface in file headers. bubbletea: tightest variance of any task. |
| Rust | 2 tasks | clap #6275: no context = failure. Rust workspaces are structurally opaque without a module map. |
| Python | 5 tasks | Moderate effect. Hub file analysis helps on larger Python codebases (pydantic, fastapi). fastapi: handwritten pointed at wrong file. |
| TypeScript | 8 tasks | Mixed. Strong wins on large repos (next.js, drizzle). Flat or negative on small, well-documented libraries. TS has the best native tooling for agent navigation. |
The ETH Zurich question
In February 2025, ETH Zurich found that auto-generated context hurts agent performance by 2–3%. That finding was right — for the kind of context they tested: stack descriptions, project structure, standard commands. Things agents can figure out by reading files.
Our data shows the flip side. Auto-generated context that surfaces non-discoverable structural signals — import topology, co-change coupling, hub files — doesn't hurt. It performs at parity with handwritten context. And it eliminates the failure modes that handwritten context can't prevent.
The problem was never auto-generation. It was what you auto-generate. Redundant information creates noise. Structural information creates a map.
Where it doesn't help
Small, well-documented TypeScript libraries. When the issue description alone is sufficient and the codebase has strong type declarations, agents navigate fine without a structural map. hono (+69%) and vercel/ai #13354 (+65%) both showed sourcebook adding overhead without value.
Simple, localized bugs. If the error message points directly to the fix, context is overhead. The value of a map scales with how lost the agent would be without one.
The pattern: sourcebook's value scales with structural opacity. Large monorepos, implicit module systems, unfamiliar languages — that's where agents need a map. Small, readable projects don't need one.
Limitations
- No individual result reached statistical significance. Agent behavior has high intrinsic variance. N=4 runs per condition is enough to see the variance pattern but not enough for p<0.05 on any single task. We're reporting what the data shows, not what we can formally prove.
- Correctness is file overlap, not semantic comparison. We compare which files the agent touched against the reference PR. This doesn't verify the logic of the fix.
- Single model (Claude Sonnet 4). Results may differ on other models.
- We wrote the handwritten baselines ourselves. A different expert might write better ones. The comparison is against a reasonable handwritten brief, not the theoretical best.
What we learned
We started this benchmark trying to prove sourcebook makes agents faster. The data told us something different. sourcebook makes agents more reliable. It eliminates the wandering runs. It prevents the wrong-file failures. It matches human-written context without human effort.
Over time, reliability is speed. The team that runs 100 tasks where every one finishes in 30–57 seconds ships faster than the team that's sometimes 21 seconds and sometimes 252. Not because the agent got faster. Because it stopped getting lost.
A map, not a boost.
Full benchmark harness, task definitions, handwritten baselines, and raw results are open source: github.com/maroondlabs/sourcebook/tree/main/benchmark
Your agent shouldn't guess. It should navigate.
npx sourcebook init
No API keys. No LLM. Generates a structural map of your codebase in 3 seconds.