BENCHMARK · MARCH 2026

Benchmark methodology and full results

22 runs across 3 repos and 4 context conditions. How we tested, what we measured, and an honest accounting of what the data proves.

Setup

Task selection

Tasks were selected from real, closed GitHub issues. Criteria: the issue had a merged PR, the fix was non-trivial (touched 1+ files, 20+ lines changed), and the task was self-contained enough that an agent could attempt it given only the issue description.

What we measured

We did not measure correctness (tests passing, lint clean, match with reference PR). This is the most important limitation. More patch lines doesn't necessarily mean better — it could mean more thorough coverage, or it could mean unnecessary changes.

Full results (initial run — v0.3)

Format: time / files / patch lines / tokens

Task None Handwritten Repomix sourcebook v0.3
cal.com #27298
OAuth i18n
241s / 5f / 381L / 22K 120s / 3f / 321L / 13K 176s / 3f / 283L / 20K 140s / 3f / 350L / 16K
cal.com #27907
PayPal i18n
94s / 5f / 314L / 11K 115s / 5f / 362L / 15K 121s / 5f / 314L / 13K 103s / 5f / 390L / 14K
hono #4806
Body caching
253s / 1f / 31L / 21K 323s / 2f / 86L / 29K 245s / 2f / 82L / 27K 274s / 2f / 82L / 28K
pydantic #12715
JSON schema
312s / 1f / 99L / 35K 180s / 2f / 83L / 19K 208s / 1f / 35L / 23K 315s / 1f / 28L / 34K

Aggregate (4 valid tasks, initial run)

Condition Avg tokens Avg time Avg patch lines
none 22,133 225s 206
handwritten 18,966 185s 213
repomix 20,958 188s 178
sourcebook v0.3 22,720 208s 213

Version progression (sourcebook only)

We re-ran the sourcebook condition with v0.4.1 and v0.5 on the same tasks to track improvement.

Task Handwritten sb v0.3 sb v0.4.1 sb v0.5
cal.com #27298 120s / 3f / 321L 140s / 3f / 350L 139s / 5f / 438L 136s / 6f / 469L
cal.com #27907 115s / 5f / 362L 103s / 5f / 390L 116s / 6f / 423L 113s / 6f / 458L
hono #4806 323s / 2f / 86L 274s / 2f / 82L 267s / 2f / 101L
pydantic #12715 180s / 2f / 83L 315s / 1f / 28L 308s / 2f / 117L

v0.4.1 added dominant pattern detection (convention scanning). v0.5 added repo-mode specialization (app vs library analysis). Dash indicates version was not tested on that task.

What the data shows

  1. sourcebook is closing the gap with handwritten. v0.5 is within ~6% of handwritten speed on Cal.com and beats it on 2 of 4 tasks by time.
  2. Patch thoroughness consistently exceeds handwritten. Across all 4 tasks, v0.5 produces more patch lines and touches equal or more files. Average: +36% more lines.
  3. Library repos improved significantly. Pydantic went from 1 file / 28 lines (v0.3) to 2 files / 117 lines (v0.5) after repo-mode specialization.
  4. Cal.com is essentially solved. Both tasks are within noise range on time, with sourcebook producing more thorough patches for fewer tokens on #27907.
  5. sourcebook may have better judgment. On the excluded pydantic task, sourcebook correctly identified no work was needed — while the no-context and handwritten conditions blindly wrote 54-55 line patches for an already-fixed issue.

What the data does not show

  1. "More thorough" is not yet proven "more correct." More patch lines could mean better coverage, or it could mean unnecessary changes. Correctness scoring (tests pass, lint clean, match with reference PR) is the most important next step.
  2. sourcebook is not proven more efficient overall. Pydantic still uses ~2x tokens compared to handwritten.
  3. Single runs, no statistical confidence. Each data point is one run. Variance between runs is unknown.
  4. Small task set. 4 valid tasks across 3 repos. More diversity is needed to generalize.

Next steps for the benchmark

$ npx sourcebook init

MORE_FROM_SOURCEBOOK