Setup
- Harness: Custom benchmark harness v0.2.0
- Model: claude-sonnet-4-20250514 (pinned, consistent across all runs)
- Repos: Cal.com (calcom/cal.com), Hono (honojs/hono), Pydantic (pydantic/pydantic)
- Tasks: 5 real GitHub issues (1 excluded due to checkout bug)
- Conditions: none (no context file), handwritten (human-authored CLAUDE.md), Repomix (repo dump), sourcebook (auto-generated brief)
Task selection
Tasks were selected from real, closed GitHub issues. Criteria: the issue had a merged PR, the fix was non-trivial (touched 1+ files, 20+ lines changed), and the task was self-contained enough that an agent could attempt it given only the issue description.
- cal.com #27298 — OAuth flow untranslated strings (i18n fix)
- cal.com #27907 — PayPal untranslated strings (i18n fix)
- hono #4806 — Request body caching (runtime behavior fix)
- pydantic #12715 — JSON schema generation (type system fix)
pydantic #12424— Model rebuild (excluded: checkout bug meant the issue was already resolved in the checked-out code)
What we measured
- Time: Wall clock seconds from task start to patch output
- Tokens: Total input + output tokens consumed
- Files changed: Number of files in the generated patch
- Patch lines: Total lines in the diff (adds + deletes)
We did not measure correctness (tests passing, lint clean, match with reference PR). This is the most important limitation. More patch lines doesn't necessarily mean better — it could mean more thorough coverage, or it could mean unnecessary changes.
Full results (initial run — v0.3)
Format: time / files / patch lines / tokens