BENCHMARK · APRIL 2026

Fast Agents Are Just Agents That Don't Wander

We ran 19 bug-fix tasks with repeat runs. The biggest finding wasn't about speed. It was about what happens when agents don't know where to start.

Here's a result from our benchmark that we didn't expect.

We asked an AI agent to fix a rendering bug in bubbletea, a Go terminal UI library. We gave it expert-written context about the project — the kind of brief a strong engineer would write for a new teammate. We ran it four times.

The results: 21 seconds. 31 seconds. 44 seconds. 252 seconds.

Same agent. Same model. Same context. Same bug. Three runs found the fix fast. One run wandered for over four minutes.

Then we ran the same task with sourcebook-generated context — a structural map of the codebase extracted automatically from import graphs and git history. Four runs: 34 seconds. 34 seconds. 41 seconds. 57 seconds.

Not faster on the best run. But the bad run never happened.

bubbletea #1322 — 4 runs each
Handwritten context
21–252s
sourcebook context
34–57s
Scale: 0–252s. Each mark is one run. Bar shows range.

This pattern repeated across our benchmark. And it changed how we think about what sourcebook actually does.

The experiment

19 real, closed GitHub issues across 10 open-source repos in 4 languages. Each task has a merged PR fix. The harness checks out the codebase to the state before the fix, validates the bug is still present, injects a context file (or none), and runs the agent.

Model: Claude Sonnet 4 (pinned). Max turns: 50. No human intervention. Then repeat runs on the top 5 tasks — 4 runs per condition — to measure variance.

Three conditions:

Repos: cal.com, hono, pydantic, fastapi, gin, chi, bubbletea, clap, next.js, vercel/ai — spanning enterprise monorepos, Python libraries, Go web frameworks, and Rust CLI tools.

What the single runs showed

In the initial benchmark (one run per condition), sourcebook was faster than handwritten context on 13 of 19 tasks. The average gap was 16%. On first glance, this looks like a speed story.

Task Lang None Handwritten sourcebook SB vs HW
next.js #74843TS107s161s30s-81%
bubbletea #1322GO135s252s57s-77%
clap #6201RS192s300s106s-65%
fastapi #14508PY278s191s70s-63%
pydantic #12715PY210s294s197s-33%
gin #4468GO102s119s81s-32%
vercel/ai #13839TS177s185s144s-22%
drizzle #4421TS179s187s149s-20%
cal.com #27907TS103s146s126s-14%
clap #6275RS155s*209s192s-8%
gin #2959GO184s196s182s-7%
vercel/ai #13988TS274s221s209s-5%
fastapi #14454PY237s261s**254s-3%
cal.com #27298TS148s140s141s+1%
chi #954GO258s133s134s+1%
pydantic #13051PY222s150s168s+12%
fastapi #14483PY221s202s227s+12%
vercel/ai #13354TS99s167s276s+65%
hono #4806TS197s232s393s+69%

*clap #6275 no-context: agent produced 0 lines (gave up). **fastapi #14454 handwritten: agent edited the wrong file (0% file overlap with reference PR).

But single runs lie. We knew that. So we ran the top 5 tasks four times each.

What the repeat runs showed

The speed advantage mostly disappeared. No individual result was statistically significant at 95% confidence. The variance within each condition is too high for N=4 to detect a reliable speed difference.

But something else appeared in the data.

Task Condition N Mean Range StdDev
next.js #74843sourcebook440s30–48s7s
handwritten472s29–162s62s
bubbletea #1322sourcebook442s34–57s11s
handwritten487s21–252s110s
clap #6201sourcebook4193s106–310s86s
handwritten4194s121–300s78s
gin #4468sourcebook597s82–110s11s
handwritten496s80–119s17s
pydantic #12715sourcebook3210s199–224s13s
handwritten3206s194–225s17s

Look at the range column. sourcebook produced the tighter range on 4 of 5 tasks. The means are close. The variance is not.

That bubbletea result — 21 to 252 seconds with handwritten context — is a 12x spread. Same agent, same brief, same bug. The agent sometimes found the right path immediately. Sometimes it wandered for four minutes.

With sourcebook, the range was 34 to 57 seconds. The bad run never happened. Not because the agent got lucky. Because the structural map — hub files from import graph analysis, module boundaries, where the rendering pipeline lives — gave it a path that doesn't depend on luck.

The mechanism

Agents don't fail because they're slow. They fail because they don't know where to start.

Point an agent at a 10,000-file monorepo and tell it to fix a bug. It will read files. Lots of files. It will trace imports, scan directory trees, try to build a mental model from raw source. Sometimes it finds the right file on the second try. Sometimes it reads 30 files before it gets oriented. That's the variance. That's where the 252-second runs come from.

Handwritten context helps — on average. But human-written briefs describe how a framework works in general. They say things like "tests are in __tests__" and "we use tRPC for API routes." That's useful. But it doesn't tell the agent which file is the structural center of the rendering pipeline, or which files always change together, or where the import graph converges.

Structural context is different. It's a map. It says: this file is imported by 52 other files. This file and that file change together 88% of the time. The rendering logic converges here. An agent with a map doesn't explore. It navigates.

That's the mechanism. Not speed. Removal of wasted motion.

Three things the data proves

1. Sourcebook reduces variance

Across 5 repeat-run tasks, sourcebook produced the tightest completion range on 4 of them. The agent finds a reliable path instead of sometimes-fast-sometimes-wandering.

next.js: 30–48s (SB) vs 29–162s (HW). bubbletea: 34–57s vs 21–252s. gin: 82–110s vs 80–119s. pydantic: 199–224s vs 194–225s. The pattern is consistent: the structural map eliminates the long tail.

2. Sourcebook prevents failure modes

clap #6275 (Rust): Without context, the agent produced 0 lines — it gave up. With sourcebook, it found and fixed the bug. 86 lines, 100% file overlap with the reference PR. Context was the difference between failure and success.

fastapi #14454: Handwritten context — written by a developer who understands FastAPI — led the agent to the wrong file. 0% file overlap with the reference PR. sourcebook led it to the correct file. 100% overlap. Human knowledge of a framework isn't always enough. The import graph found what the human missed.

These aren't speed differences. They're success vs failure. The agent didn't just wander slowly — it went the wrong direction entirely.

3. Sourcebook matches handwritten context — without manual effort

On repeat runs, sourcebook's mean completion time is within noise of handwritten context on every task. gin: 97s vs 96s. pydantic: 210s vs 206s. clap #6201: 193s vs 194s.

The handwritten briefs took time to write. One of them pointed at the wrong file. sourcebook generates in 3 seconds, for any repo, in any language. Matching human performance at zero marginal cost is the product.

By language

Structural context helps most where the module system is implicit and agents need a map to navigate.

LanguageTasksSignal
Go4 tasksStrongest effect. Import graph analysis gives agents the structural map Go doesn't surface in file headers. bubbletea: tightest variance of any task.
Rust2 tasksclap #6275: no context = failure. Rust workspaces are structurally opaque without a module map.
Python5 tasksModerate effect. Hub file analysis helps on larger Python codebases (pydantic, fastapi). fastapi: handwritten pointed at wrong file.
TypeScript8 tasksMixed. Strong wins on large repos (next.js, drizzle). Flat or negative on small, well-documented libraries. TS has the best native tooling for agent navigation.

The ETH Zurich question

In February 2025, ETH Zurich found that auto-generated context hurts agent performance by 2–3%. That finding was right — for the kind of context they tested: stack descriptions, project structure, standard commands. Things agents can figure out by reading files.

Our data shows the flip side. Auto-generated context that surfaces non-discoverable structural signals — import topology, co-change coupling, hub files — doesn't hurt. It performs at parity with handwritten context. And it eliminates the failure modes that handwritten context can't prevent.

The problem was never auto-generation. It was what you auto-generate. Redundant information creates noise. Structural information creates a map.

Where it doesn't help

Small, well-documented TypeScript libraries. When the issue description alone is sufficient and the codebase has strong type declarations, agents navigate fine without a structural map. hono (+69%) and vercel/ai #13354 (+65%) both showed sourcebook adding overhead without value.

Simple, localized bugs. If the error message points directly to the fix, context is overhead. The value of a map scales with how lost the agent would be without one.

The pattern: sourcebook's value scales with structural opacity. Large monorepos, implicit module systems, unfamiliar languages — that's where agents need a map. Small, readable projects don't need one.

Limitations

What we learned

We started this benchmark trying to prove sourcebook makes agents faster. The data told us something different. sourcebook makes agents more reliable. It eliminates the wandering runs. It prevents the wrong-file failures. It matches human-written context without human effort.

Over time, reliability is speed. The team that runs 100 tasks where every one finishes in 30–57 seconds ships faster than the team that's sometimes 21 seconds and sometimes 252. Not because the agent got faster. Because it stopped getting lost.

A map, not a boost.

Full benchmark harness, task definitions, handwritten baselines, and raw results are open source: github.com/maroondlabs/sourcebook/tree/main/benchmark

Your agent shouldn't guess. It should navigate.

$ npx sourcebook init

No API keys. No LLM. Generates a structural map of your codebase in 3 seconds.