Fast Agents Are Just Agents That Don't Wander

Here's a result from our benchmark that we didn't expect.

We asked an AI agent to fix a rendering bug in bubbletea, a Go terminal UI library. We gave it expert-written context about the project — the kind of brief a strong engineer would write for a new teammate. We ran it four times.

The results: 21 seconds. 31 seconds. 44 seconds. 252 seconds.

Same agent. Same model. Same context. Same bug. Three runs found the fix fast. One run wandered for over four minutes.

Then we ran the same task with sourcebook-generated context — a structural map of the codebase extracted automatically from import graphs and git history. Four runs: 34 seconds. 34 seconds. 41 seconds. 57 seconds.

Not faster on the best run. But the bad run never happened.

bubbletea #1322 — 4 runs each

Handwritten context

21–252s

sourcebook context

34–57s

Scale: 0–252s. Each mark is one run. Bar shows range.

This pattern repeated across our benchmark. And it changed how we think about what sourcebook actually does.

The experiment

19 real, closed GitHub issues across 10 open-source repos in 4 languages. Each task has a merged PR fix. The harness checks out the codebase to the state before the fix, validates the bug is still present, injects a context file (or none), and runs the agent.

Model: Claude Sonnet 4 (pinned). Max turns: 50. No human intervention. Then repeat runs on the top 5 tasks — 4 runs per condition — to measure variance.

Three conditions:

None — just the issue description and the codebase.
Handwritten — a CLAUDE.md written by a developer who understands the project.
sourcebook — auto-generated structural context. Hub files from import graph analysis, co-change coupling from git forensics, convention signals. No stack labels, no project structure, no standard commands — only what agents can't discover by reading files.

Repos: cal.com, hono, pydantic, fastapi, gin, chi, bubbletea, clap, next.js, vercel/ai — spanning enterprise monorepos, Python libraries, Go web frameworks, and Rust CLI tools.

What the single runs showed

In the initial benchmark (one run per condition), sourcebook was faster than handwritten context on 13 of 19 tasks. The average gap was 16%. On first glance, this looks like a speed story.

Task	Lang	None	Handwritten	sourcebook	SB vs HW
next.js #74843	TS	107s	161s	30s	-81%
bubbletea #1322	GO	135s	252s	57s	-77%
clap #6201	RS	192s	300s	106s	-65%
fastapi #14508	PY	278s	191s	70s	-63%
pydantic #12715	PY	210s	294s	197s	-33%
gin #4468	GO	102s	119s	81s	-32%
vercel/ai #13839	TS	177s	185s	144s	-22%
drizzle #4421	TS	179s	187s	149s	-20%
cal.com #27907	TS	103s	146s	126s	-14%
clap #6275	RS	155s*	209s	192s	-8%
gin #2959	GO	184s	196s	182s	-7%
vercel/ai #13988	TS	274s	221s	209s	-5%
fastapi #14454	PY	237s	261s**	254s	-3%
cal.com #27298	TS	148s	140s	141s	+1%
chi #954	GO	258s	133s	134s	+1%
pydantic #13051	PY	222s	150s	168s	+12%
fastapi #14483	PY	221s	202s	227s	+12%
vercel/ai #13354	TS	99s	167s	276s	+65%
hono #4806	TS	197s	232s	393s	+69%

*clap #6275 no-context: agent produced 0 lines (gave up). **fastapi #14454 handwritten: agent edited the wrong file (0% file overlap with reference PR).

But single runs lie. We knew that. So we ran the top 5 tasks four times each.

What the repeat runs showed

The speed advantage mostly disappeared. No individual result was statistically significant at 95% confidence. The variance within each condition is too high for N=4 to detect a reliable speed difference.

But something else appeared in the data.

Task	Condition	N	Mean	Range	StdDev
next.js #74843	sourcebook	4	40s	30–48s	7s
next.js #74843	handwritten	4	72s	29–162s	62s
bubbletea #1322	sourcebook	4	42s	34–57s	11s
bubbletea #1322	handwritten	4	87s	21–252s	110s
clap #6201	sourcebook	4	193s	106–310s	86s
clap #6201	handwritten	4	194s	121–300s	78s
gin #4468	sourcebook	5	97s	82–110s	11s
gin #4468	handwritten	4	96s	80–119s	17s
pydantic #12715	sourcebook	3	210s	199–224s	13s
pydantic #12715	handwritten	3	206s	194–225s	17s

Look at the range column. sourcebook produced the tighter range on 4 of 5 tasks. The means are close. The variance is not.

That bubbletea result — 21 to 252 seconds with handwritten context — is a 12x spread. Same agent, same brief, same bug. The agent sometimes found the right path immediately. Sometimes it wandered for four minutes.

With sourcebook, the range was 34 to 57 seconds. The bad run never happened. Not because the agent got lucky. Because the structural map — hub files from import graph analysis, module boundaries, where the rendering pipeline lives — gave it a path that doesn't depend on luck.

The mechanism

Agents don't fail because they're slow. They fail because they don't know where to start.

Point an agent at a 10,000-file monorepo and tell it to fix a bug. It will read files. Lots of files. It will trace imports, scan directory trees, try to build a mental model from raw source. Sometimes it finds the right file on the second try. Sometimes it reads 30 files before it gets oriented. That's the variance. That's where the 252-second runs come from.

Handwritten context helps — on average. But human-written briefs describe how a framework works in general. They say things like "tests are in __tests__" and "we use tRPC for API routes." That's useful. But it doesn't tell the agent which file is the structural center of the rendering pipeline, or which files always change together, or where the import graph converges.

Structural context is different. It's a map. It says: this file is imported by 52 other files. This file and that file change together 88% of the time. The rendering logic converges here. An agent with a map doesn't explore. It navigates.

That's the mechanism. Not speed. Removal of wasted motion.

Three things the data proves

1. Sourcebook reduces variance

Across 5 repeat-run tasks, sourcebook produced the tightest completion range on 4 of them. The agent finds a reliable path instead of sometimes-fast-sometimes-wandering.

next.js: 30–48s (SB) vs 29–162s (HW). bubbletea: 34–57s vs 21–252s. gin: 82–110s vs 80–119s. pydantic: 199–224s vs 194–225s. The pattern is consistent: the structural map eliminates the long tail.

2. Sourcebook prevents failure modes

clap #6275 (Rust): Without context, the agent produced 0 lines — it gave up. With sourcebook, it found and fixed the bug. 86 lines, 100% file overlap with the reference PR. Context was the difference between failure and success.

fastapi #14454: Handwritten context — written by a developer who understands FastAPI — led the agent to the wrong file. 0% file overlap with the reference PR. sourcebook led it to the correct file. 100% overlap. Human knowledge of a framework isn't always enough. The import graph found what the human missed.

These aren't speed differences. They're success vs failure. The agent didn't just wander slowly — it went the wrong direction entirely.

3. Sourcebook matches handwritten context — without manual effort

On repeat runs, sourcebook's mean completion time is within noise of handwritten context on every task. gin: 97s vs 96s. pydantic: 210s vs 206s. clap #6201: 193s vs 194s.

The handwritten briefs took time to write. One of them pointed at the wrong file. sourcebook generates in 3 seconds, for any repo, in any language. Matching human performance at zero marginal cost is the product.

By language

Structural context helps most where the module system is implicit and agents need a map to navigate.

Language	Tasks	Signal
Go	4 tasks	Strongest effect. Import graph analysis gives agents the structural map Go doesn't surface in file headers. bubbletea: tightest variance of any task.
Rust	2 tasks	clap #6275: no context = failure. Rust workspaces are structurally opaque without a module map.
Python	5 tasks	Moderate effect. Hub file analysis helps on larger Python codebases (pydantic, fastapi). fastapi: handwritten pointed at wrong file.
TypeScript	8 tasks	Mixed. Strong wins on large repos (next.js, drizzle). Flat or negative on small, well-documented libraries. TS has the best native tooling for agent navigation.

The ETH Zurich question

In February 2025, ETH Zurich found that auto-generated context hurts agent performance by 2–3%. That finding was right — for the kind of context they tested: stack descriptions, project structure, standard commands. Things agents can figure out by reading files.

Our data shows the flip side. Auto-generated context that surfaces non-discoverable structural signals — import topology, co-change coupling, hub files — doesn't hurt. It performs at parity with handwritten context. And it eliminates the failure modes that handwritten context can't prevent.

The problem was never auto-generation. It was what you auto-generate. Redundant information creates noise. Structural information creates a map.

Where it doesn't help

Small, well-documented TypeScript libraries. When the issue description alone is sufficient and the codebase has strong type declarations, agents navigate fine without a structural map. hono (+69%) and vercel/ai #13354 (+65%) both showed sourcebook adding overhead without value.

Simple, localized bugs. If the error message points directly to the fix, context is overhead. The value of a map scales with how lost the agent would be without one.

The pattern: sourcebook's value scales with structural opacity. Large monorepos, implicit module systems, unfamiliar languages — that's where agents need a map. Small, readable projects don't need one.

Limitations

No individual result reached statistical significance. Agent behavior has high intrinsic variance. N=4 runs per condition is enough to see the variance pattern but not enough for p<0.05 on any single task. We're reporting what the data shows, not what we can formally prove.
Correctness is file overlap, not semantic comparison. We compare which files the agent touched against the reference PR. This doesn't verify the logic of the fix.
Single model (Claude Sonnet 4). Results may differ on other models.
We wrote the handwritten baselines ourselves. A different expert might write better ones. The comparison is against a reasonable handwritten brief, not the theoretical best.

What we learned

We started this benchmark trying to prove sourcebook makes agents faster. The data told us something different. sourcebook makes agents more reliable. It eliminates the wandering runs. It prevents the wrong-file failures. It matches human-written context without human effort.

Over time, reliability is speed. The team that runs 100 tasks where every one finishes in 30–57 seconds ships faster than the team that's sometimes 21 seconds and sometimes 252. Not because the agent got faster. Because it stopped getting lost.

A map, not a boost.

Full benchmark harness, task definitions, handwritten baselines, and raw results are open source: github.com/maroondlabs/sourcebook/tree/main/benchmark

Your agent shouldn't guess. It should navigate.

$ npx sourcebook init

No API keys. No LLM. Generates a structural map of your codebase in 3 seconds.