ENGINEERING · APRIL 2026

AI agents don't fail. They stop too early.

I spent two weeks building context tools for AI coding agents. The data killed every hypothesis — then revealed the actual problem.

I build sourcebook — a CLI that analyzes codebases for AI coding agents. Import graphs, git forensics, convention detection, co-change coupling. It generates context files (CLAUDE.md, AGENTS.md) with the kind of project knowledge that agents can't infer from reading code alone.

I had a clear thesis: give agents better maps, they'll write better code. I spent two weeks testing it. The data said no. Then it showed me what actually matters.

The hypothesis

The scanner maps everything an agent should know before editing a codebase: which files are imported by hundreds of others (hub files), which commits were reverted (approaches the team rejected), which files always change together (co-change pairs), where the generated code lives (files an agent shouldn't edit).

I scanned 30+ repos, fixed 12 bugs in the scanner, built a QA suite with 69 regression tests. The scanner works. It produces accurate, verifiable structural analysis. That was never the question.

The question was: does giving this to an agent actually help?

The benchmark

I ran 30+ real GitHub issues across 17 repos in 4 languages with a pinned model. Controlled conditions, multiple runs, no cherry-picking. Four treatment groups:

This is the part most devtools skip — honestly measuring whether the tool actually helps. It's uncomfortable to build something for two weeks and then subject it to a controlled experiment. But if you don't, you're selling vibes.

The results

sourcebook context: 41% file recall. No context: 42%. Statistically nothing.

Handwritten developer briefs: worse on hard tasks. Went from 42% to 19%. Agents anchored on the summaries and stopped exploring. The human-written guide became a ceiling, not a floor.

MCP tools: agents called them 0 times across 3 runs. When an agent has grep and file read built in, it ignores optional tools entirely. They don't browse menus of available tools. They use what's in front of them.

benchmark results (30+ tasks, pinned model)

No context:          42% file recall
sourcebook context:  41% file recall  (no improvement)
Developer briefs:    19% on hard tasks  (actively worse)
MCP tools:           0 tool calls     (completely ignored)

The map was accurate. Agents just don't use context effectively. Two weeks of scanner work, and the hypothesis was dead.

The pattern

I could have stopped there. Instead I looked at failures one by one. Not the scores — the actual diffs.

The code was correct. The changes were incomplete.

6 out of 9 hard-task failures followed the exact same pattern: the agent finds the primary file, makes a clean edit, and stops. Test files untouched. Sibling modules unchanged. Config files stale. The agent solved the core problem and declared victory without finishing the job.

This isn't a context problem. The agent doesn't need more information upfront. It needs someone checking whether the work is actually done.

It's a completeness problem.

The pivot

I stopped trying to give agents more information. I started checking their work instead.

The same engine — import graphs, co-change coupling, test file mappings — but applied differently. Instead of pre-task context, it's post-edit validation. Read the diff, check what changed against what should have changed, flag what's missing.

$ sourcebook check

Analyzing diff...

WARNING Modified src/git/utils.rs but not src/git/source.rs
  These files have 87% co-change coupling (changed together in 14/16 commits)

WARNING Modified src/planning/plan_instance.go but not tests
  No test file found covering plan_instance.go

CLEAN src/config/defaults.rs — no coupled files affected

sourcebook check reads your diff and flags missing companion files. No LLM, under a second, deterministic.

sourcebook check --ai sends the diff plus dependency context to Claude Sonnet for semantic analysis — catches things like interface changes that need downstream updates. About $0.012 per run.

The validation

Tested on 30 real diffs. Not synthetic examples — actual incomplete changes from the benchmark failures and real-world repos.

validation results (30 diffs)

Completeness gate:      100%  (30/30 — correctly flags incomplete vs clean)
False positives:        0%   (clean diffs never flagged)
Test file detection:    73%  (finds missing test files)
Sibling detection:      71%  (finds missing co-changed modules)
Hard semantic catches:  33%  (this is where --ai helps)

100% on the completeness gate means it never lets an incomplete change through silently and never blocks a clean change. That's the binary that matters — is the agent done or not?

The 73% and 71% numbers are about specificity — which files are missing. Good enough to be useful, honest enough to admit it's not perfect. The 33% on hard semantic catches is exactly why --ai exists: some relationships can't be found in git history alone.

The integration

The real product isn't a CLI you run manually. It's a hook that runs automatically.

sourcebook init sets up Claude Code hooks. The agent edits code, sourcebook checks the diff before commit, the agent sees what's missing, and fixes it — all before the change ever leaves the local machine.

$ sourcebook init

Setting up hooks...
  + Pre-commit hook: sourcebook check
  + Stop hook: sourcebook check --ai (optional)
  + MCP server registered

Agent workflow:
  1. Agent edits code
  2. sourcebook checks the diff
  3. Agent sees missing files
  4. Agent completes the change
  5. Clean commit

Four surfaces: CLI, hooks, MCP server, and GitHub App (coming). The CLI, hooks, and MCP server are free. The GitHub App is where teams get value — completeness checks on every PR, same analysis applied to human and agent commits alike.

What I learned

Two weeks of building context tools. Discovering agents can't use them. Finding a better application for the same structural analysis.

The data killed the hypothesis. I could have rationalized — maybe the benchmark was wrong, maybe the context needed to be formatted differently, maybe agents just need to be prompted harder to read the file. I've seen other devtools make those excuses. The numbers don't lie. 41% vs 42% is nothing.

The pivot came from looking at failures one by one. Not aggregate scores — individual diffs. That's where the pattern was hiding. The agent isn't wrong. It's incomplete. And the same structural data that was useless as context is extremely useful as a checklist.

Sometimes the right product is the one the data tells you to build, not the one you planned.

npx sourcebook check

github.com/maroondlabs/sourcebook

TRY_IT_YOURSELF

npx sourcebook check
VIEW_SOURCE

Free CLI. Free hooks. Free MCP server. No API keys required.

MORE_FROM_SOURCEBOOK

We scanned 30+ repos. Here's what broke. arrow_forward Why auto-generated context makes agents worse arrow_forward Explore: How 32 open-source repos actually work arrow_forward