We Ran 30+ Real GitHub Issues Through an AI Agent. Here's What Actually Broke.

Here's what we thought the problem was: context.

Give an AI agent better information about the codebase — the conventions, the architecture, the fragile files — and it would make fewer mistakes. That was the hypothesis. So we built a benchmark to test it.

30+ real GitHub issues. 17 repositories. 4 languages. Claude running against each one under three conditions: no context file, a generated CLAUDE.md, a handwritten brief written by someone who knew the codebase.

The pass rates were nearly identical.

We ran the same tasks multiple times. We varied the repos. We used issues that had already been fixed, so we knew what "correct" looked like. The context files didn't move the needle.

What did?

The benchmark

We pulled closed GitHub issues with merged PRs across 17 repositories — Go, Rust, TypeScript, Python. Repos you'd recognize: ripgrep, FastAPI, Django, Next.js, Polars, Deno, clap, hono, gin, prometheus. Each issue had a reference fix we could compare against.

We ran Claude Sonnet against each issue under three conditions:

● None — no context file. Just the issue description and the codebase.
● Sourcebook-generated CLAUDE.md — auto-extracted from import graphs and git history.
● Handwritten brief — a CLAUDE.md written by someone who understood the project.

We split the tasks into two categories. Easy tasks: single-file fixes, clear issue descriptions, smaller repos. Hard tasks: multi-file fixes, symptom-only descriptions, large repos with 500–2000+ files.

The metric: file recall — did the agent edit the same source files as the reference PR? It's an imperfect metric, but it's measurable and it catches the most common failure mode: an agent that fixes the symptom in one file while missing every other file the real fix touched.

Finding 1: context doesn't help on easy tasks

On easy tasks (21 of them), all three conditions came out near-identical.

EASY TASKS (21) — FILE RECALL BY CONDITION

Condition	Avg file recall	Note
None	~85%	baseline
Sourcebook	~85%	no change
Handwritten	~80%	slightly worse

On simple tasks, agents don't need context files. The issue description is usually enough to find the right files. Adding more information doesn't help — and sometimes it distracts.

Finding 2: context doesn't help on hard tasks either

Hard tasks were where we expected context to matter. Large repos, multi-file fixes, vague symptom descriptions. This is exactly the scenario CLAUDE.md was designed for.

HARD TASKS (12) — FILE RECALL BY CONDITION

Condition	Avg file recall	Note
None	42%	baseline
Sourcebook	41%	statistically identical
Handwritten	19%	significantly worse

Sourcebook matched the baseline exactly. Handwritten context was worse than no context at all — by 23 percentage points.

The failures were sharp. On ansible #86694, handwritten context dropped file recall from 67% to 0%. On prometheus #18397, it dropped from 100% to 0%. The human-written summaries emphasized the wrong things and sent the agent in the wrong direction.

Human-written context can make agents worse.

This is consistent with what ETH Zurich found in their February 2025 AGENTS.md study: auto-generated context files that restate obvious information degrade agent performance. The mechanism is the same. Additional context competes for attention with the task itself. When the context is wrong or misleading, the agent follows it.

Sourcebook never showed this degradation. It matched baseline without beating it — but it also never sent agents in the wrong direction.

Finding 3: variance reduction is real

Across all conditions, we saw one consistent difference: repeat-run variance.

We asked the agent to fix the same task four times. On bubbletea with handwritten context, the run times were 21 seconds, 31 seconds, 44 seconds, and 252 seconds. Same agent, same model, same context, same bug. The slow run wandered for over four minutes.

With sourcebook context: 34 seconds, 34 seconds, 41 seconds, 57 seconds.

Not faster on the best run. But the bad run never happened.

VARIANCE REDUCTION (sourcebook vs handwritten, repeat runs)

Task	Handwritten StdDev	Sourcebook StdDev	Reduction
bubbletea	110s	11s	90%
next.js	62s	7s	88%
pydantic	46s	12s	74%
gin	17s	11s	32%

Structural context doesn't make agents smarter. It makes them more predictable. In CI, that matters a lot.

So what actually catches failures?

Here's the pattern we kept seeing in the hard task failures.

The agent would fix the right file. It would pass basic tests. It would produce a reasonable-looking patch. And then it would stop — without touching the two or three other files that always change alongside the one it just edited.

It wasn't that the agent failed to understand the task. It was that the agent didn't know what it hadn't checked.

This is a different failure mode than "agent went to the wrong file." The agent went to the right file. It just didn't finish. And nothing in the standard workflow caught it.

We built a check for this: sourcebook check.

It looks at the diff and asks: given the files you changed, what else should have changed? It uses two layers. Layer A is static: git history of what files co-change together, import graph traversal to find what depends on what. Layer B is AI-assisted: structural analysis of whether the change looks complete.

Layer A runs in milliseconds and catches the structural gaps. It's the one you wire into CI.

HOW IT WORKS

$ npx sourcebook check --branch main --quiet

# exit 0 = complete

# exit 1 = incomplete, with flagged files

Works in GitHub Actions, pre-commit, or standalone.

We ran sourcebook check against the benchmark failures. It flagged the incomplete changes. In every case where the agent missed files that should have changed together, the check caught it.

The insight isn't that AI agents are bad. It's that they have no mechanism to verify their own completeness. They produce a patch and stop. There's no step that asks: "did I check everything that needed to change?"

sourcebook check is that step.

What we learned

Context files are orientation tools, not quality guarantees. They help agents start in the right place. They don't help agents know when they're done.

Handwritten context has a real downside at scale. A carefully written brief is great when it's accurate. When it emphasizes the wrong files or describes a workflow that doesn't match the specific issue, it misleads. On our hardest tasks, it was measurably worse than nothing.

The failure mode that matters most is incompleteness, not incorrectness. Agents that go to the wrong file are easy to catch. Agents that go to the right file but stop too early are invisible until the PR lands and someone else hits the bug.

The check belongs in the workflow, not in the context. Adding more information before the agent runs doesn't solve incompleteness. Verifying the output after the agent runs does.

Limitations

File recall is an imperfect metric. It measures whether the agent touched the same files as the reference PR — not whether those changes were correct. An agent could touch all the right files and still write the wrong fix.

We built the tool and ran the benchmark. That's a conflict of interest and we know it. The benchmark harness is open, the issues and reference PRs are all public, and we've tried to report the results honestly — including the cases where context didn't help and where our tool showed no accuracy improvement.

If you want to reproduce the results or run it on your own repos: the benchmark harness is in the repo.