I build sourcebook — a CLI that analyzes codebases for AI coding agents. Import graphs, git forensics, convention detection, co-change coupling. It generates context files (CLAUDE.md, AGENTS.md) with the kind of project knowledge that agents can't infer from reading code alone.
I had a clear thesis: give agents better maps, they'll write better code. I spent two weeks testing it. The data said no. Then it showed me what actually matters.
The hypothesis
The scanner maps everything an agent should know before editing a codebase: which files are imported by hundreds of others (hub files), which commits were reverted (approaches the team rejected), which files always change together (co-change pairs), where the generated code lives (files an agent shouldn't edit).
I scanned 30+ repos, fixed 12 bugs in the scanner, built a QA suite with 69 regression tests. The scanner works. It produces accurate, verifiable structural analysis. That was never the question.
The question was: does giving this to an agent actually help?
The benchmark
I ran 30+ real GitHub issues across 17 repos in 4 languages with a pinned model. Controlled conditions, multiple runs, no cherry-picking. Four treatment groups:
- → No context — agent gets only the issue description and repo access
- → sourcebook context — agent gets the generated AGENTS.md with structural analysis
- → Handwritten developer briefs — agent gets a human-written project guide
- → MCP tools — agent gets live access to sourcebook's analysis via tool calls
This is the part most devtools skip — honestly measuring whether the tool actually helps. It's uncomfortable to build something for two weeks and then subject it to a controlled experiment. But if you don't, you're selling vibes.
The results
sourcebook context: 41% file recall. No context: 42%. Statistically nothing.
Handwritten developer briefs: worse on hard tasks. Went from 42% to 19%. Agents anchored on the summaries and stopped exploring. The human-written guide became a ceiling, not a floor.
MCP tools: agents called them 0 times across 3 runs. When an agent has grep and file read built in, it ignores optional tools entirely. They don't browse menus of available tools. They use what's in front of them.
benchmark results (30+ tasks, pinned model) No context: 42% file recall sourcebook context: 41% file recall (no improvement) Developer briefs: 19% on hard tasks (actively worse) MCP tools: 0 tool calls (completely ignored)
The map was accurate. Agents just don't use context effectively. Two weeks of scanner work, and the hypothesis was dead.
The pattern
I could have stopped there. Instead I looked at failures one by one. Not the scores — the actual diffs.
The code was correct. The changes were incomplete.
6 out of 9 hard-task failures followed the exact same pattern: the agent finds the primary file, makes a clean edit, and stops. Test files untouched. Sibling modules unchanged. Config files stale. The agent solved the core problem and declared victory without finishing the job.
- → Ansible — found
galaxy.py, made the fix, missed the proxy module that shares the same interface - → Cargo — found
git/utils.rs, edited it correctly, missedgit/source.rswhich uses the same function - → Terraform — found
plan_instance.go, updated one planning path, missedplan.gowhich handles the parallel case
This isn't a context problem. The agent doesn't need more information upfront. It needs someone checking whether the work is actually done.
It's a completeness problem.
The pivot
I stopped trying to give agents more information. I started checking their work instead.
The same engine — import graphs, co-change coupling, test file mappings — but applied differently. Instead of pre-task context, it's post-edit validation. Read the diff, check what changed against what should have changed, flag what's missing.
$ sourcebook check Analyzing diff... WARNING Modified src/git/utils.rs but not src/git/source.rs These files have 87% co-change coupling (changed together in 14/16 commits) WARNING Modified src/planning/plan_instance.go but not tests No test file found covering plan_instance.go CLEAN src/config/defaults.rs — no coupled files affected
sourcebook check reads your diff and flags missing companion files. No LLM, under a second, deterministic.
sourcebook check --ai sends the diff plus dependency context to Claude Sonnet for semantic analysis — catches things like interface changes that need downstream updates. About $0.012 per run.
The validation
Tested on 30 real diffs. Not synthetic examples — actual incomplete changes from the benchmark failures and real-world repos.
validation results (30 diffs) Completeness gate: 100% (30/30 — correctly flags incomplete vs clean) False positives: 0% (clean diffs never flagged) Test file detection: 73% (finds missing test files) Sibling detection: 71% (finds missing co-changed modules) Hard semantic catches: 33% (this is where --ai helps)
100% on the completeness gate means it never lets an incomplete change through silently and never blocks a clean change. That's the binary that matters — is the agent done or not?
The 73% and 71% numbers are about specificity — which files are missing. Good enough to be useful, honest enough to admit it's not perfect. The 33% on hard semantic catches is exactly why --ai exists: some relationships can't be found in git history alone.
The integration
The real product isn't a CLI you run manually. It's a hook that runs automatically.
sourcebook init sets up Claude Code hooks. The agent edits code, sourcebook checks the diff before commit, the agent sees what's missing, and fixes it — all before the change ever leaves the local machine.
$ sourcebook init Setting up hooks... + Pre-commit hook: sourcebook check + Stop hook: sourcebook check --ai (optional) + MCP server registered Agent workflow: 1. Agent edits code 2. sourcebook checks the diff 3. Agent sees missing files 4. Agent completes the change 5. Clean commit
Four surfaces: CLI, hooks, MCP server, and GitHub App (coming). The CLI, hooks, and MCP server are free. The GitHub App is where teams get value — completeness checks on every PR, same analysis applied to human and agent commits alike.
What I learned
Two weeks of building context tools. Discovering agents can't use them. Finding a better application for the same structural analysis.
The data killed the hypothesis. I could have rationalized — maybe the benchmark was wrong, maybe the context needed to be formatted differently, maybe agents just need to be prompted harder to read the file. I've seen other devtools make those excuses. The numbers don't lie. 41% vs 42% is nothing.
The pivot came from looking at failures one by one. Not aggregate scores — individual diffs. That's where the pattern was hiding. The agent isn't wrong. It's incomplete. And the same structural data that was useless as context is extremely useful as a checklist.
Sometimes the right product is the one the data tells you to build, not the one you planned.
npx sourcebook check