BENCHMARK · MARCH 2026

Benchmark methodology and full results

22 runs across 3 repos and 4 context conditions. How we tested, what we measured, and an honest accounting of what the data proves.

Setup

Harness: Custom benchmark harness v0.2.0
Model: claude-sonnet-4-20250514 (pinned, consistent across all runs)
Repos: Cal.com (calcom/cal.com), Hono (honojs/hono), Pydantic (pydantic/pydantic)
Tasks: 5 real GitHub issues (1 excluded due to checkout bug)
Conditions: none (no context file), handwritten (human-authored CLAUDE.md), Repomix (repo dump), sourcebook (auto-generated brief)

Task selection

Tasks were selected from real, closed GitHub issues. Criteria: the issue had a merged PR, the fix was non-trivial (touched 1+ files, 20+ lines changed), and the task was self-contained enough that an agent could attempt it given only the issue description.

cal.com #27298 — OAuth flow untranslated strings (i18n fix)
cal.com #27907 — PayPal untranslated strings (i18n fix)
hono #4806 — Request body caching (runtime behavior fix)
pydantic #12715 — JSON schema generation (type system fix)
~~pydantic #12424~~ — Model rebuild (excluded: checkout bug meant the issue was already resolved in the checked-out code)

What we measured

Time: Wall clock seconds from task start to patch output
Tokens: Total input + output tokens consumed
Files changed: Number of files in the generated patch
Patch lines: Total lines in the diff (adds + deletes)

We did not measure correctness (tests passing, lint clean, match with reference PR). This is the most important limitation. More patch lines doesn't necessarily mean better — it could mean more thorough coverage, or it could mean unnecessary changes.

Full results (initial run — v0.3)

Format: time / files / patch lines / tokens

Task	None	Handwritten	Repomix	sourcebook v0.3
cal.com #27298 OAuth i18n	241s / 5f / 381L / 22K	120s / 3f / 321L / 13K	176s / 3f / 283L / 20K	140s / 3f / 350L / 16K
cal.com #27907 PayPal i18n	94s / 5f / 314L / 11K	115s / 5f / 362L / 15K	121s / 5f / 314L / 13K	103s / 5f / 390L / 14K
hono #4806 Body caching	253s / 1f / 31L / 21K	323s / 2f / 86L / 29K	245s / 2f / 82L / 27K	274s / 2f / 82L / 28K
pydantic #12715 JSON schema	312s / 1f / 99L / 35K	180s / 2f / 83L / 19K	208s / 1f / 35L / 23K	315s / 1f / 28L / 34K

Aggregate (4 valid tasks, initial run)

Condition	Avg tokens	Avg time	Avg patch lines
none	22,133	225s	206
handwritten	18,966	185s	213
repomix	20,958	188s	178
sourcebook v0.3	22,720	208s	213

Version progression (sourcebook only)

We re-ran the sourcebook condition with v0.4.1 and v0.5 on the same tasks to track improvement.

Task	Handwritten	sb v0.3	sb v0.4.1	sb v0.5
cal.com #27298	120s / 3f / 321L	140s / 3f / 350L	139s / 5f / 438L	136s / 6f / 469L
cal.com #27907	115s / 5f / 362L	103s / 5f / 390L	116s / 6f / 423L	113s / 6f / 458L
hono #4806	323s / 2f / 86L	274s / 2f / 82L	—	267s / 2f / 101L
pydantic #12715	180s / 2f / 83L	315s / 1f / 28L	—	308s / 2f / 117L

v0.4.1 added dominant pattern detection (convention scanning). v0.5 added repo-mode specialization (app vs library analysis). Dash indicates version was not tested on that task.

What the data shows

sourcebook is closing the gap with handwritten. v0.5 is within ~6% of handwritten speed on Cal.com and beats it on 2 of 4 tasks by time.
Patch thoroughness consistently exceeds handwritten. Across all 4 tasks, v0.5 produces more patch lines and touches equal or more files. Average: +36% more lines.
Library repos improved significantly. Pydantic went from 1 file / 28 lines (v0.3) to 2 files / 117 lines (v0.5) after repo-mode specialization.
Cal.com is essentially solved. Both tasks are within noise range on time, with sourcebook producing more thorough patches for fewer tokens on #27907.
sourcebook may have better judgment. On the excluded pydantic task, sourcebook correctly identified no work was needed — while the no-context and handwritten conditions blindly wrote 54-55 line patches for an already-fixed issue.

What the data does not show

"More thorough" is not yet proven "more correct." More patch lines could mean better coverage, or it could mean unnecessary changes. Correctness scoring (tests pass, lint clean, match with reference PR) is the most important next step.
sourcebook is not proven more efficient overall. Pydantic still uses ~2x tokens compared to handwritten.
Single runs, no statistical confidence. Each data point is one run. Variance between runs is unknown.
Small task set. 4 valid tasks across 3 repos. More diversity is needed to generalize.

Next steps for the benchmark

Add correctness scoring (tests, lint, file overlap with reference PR)
Expand to 20-30 tasks with more repo diversity
Run 2-3 reruns per task/condition for statistical confidence
Benchmark on a fresh set of tasks (not re-runs) to check for overfitting
Compare sourcebook output vs handwritten to identify what humans emphasize that sourcebook misses

$ npx sourcebook init

VIEW_ON_GITHUB ALL_RESEARCH

MORE_FROM_SOURCEBOOK

sourcebook vs Repomix arrow_forward sourcebook vs hand-written context files arrow_forward Blog: We benchmarked AI context files on real GitHub issues arrow_forward