DATA · APRIL 2026

15 open-source repos: raw scan data

85,000+ files scanned, 135 findings extracted. Per-repo statistics, methodology, and what the data reveals about real codebase structure.

Methodology

Each repository was cloned to a local directory and scanned with sourcebook v0.8.3. The scan runs entirely locally with no API calls. It performs:

File tree walk — inventories all files, detects languages and frameworks from package.json, pyproject.toml, go.mod, etc.
Import graph construction — parses imports/requires across files, builds a dependency graph, ranks files by PageRank score to identify hub files.
Git history analysis — examines recent commits for co-change coupling (files that change together), reverted commits, and fragile files (many rapid edits).
Pattern matching — 50+ regex patterns detect naming conventions, test frameworks, auth patterns, routing approaches, and build tools.
Hybrid sampling — for large repos, samples representative files rather than parsing everything. Ensures scans complete in seconds, not minutes.

Each finding is assigned a confidence level (high or medium) and a "discoverable" flag indicating whether an agent could reasonably find this information on its own. Only non-discoverable findings appear in the output brief.

Repos scanned

Repo	Language	Type	Explore
Next.js	TypeScript	Monorepo / Framework	view →
Cal.com	TypeScript	Monorepo / App	view →
Supabase	TypeScript	Monorepo / Platform	view →
Django	Python	Framework	view →
FastAPI	Python	Library	view →
Express	JavaScript	Library	view →
Flask	Python	Library	view →
Hono	TypeScript	Library	view →
Fastify	JavaScript	Library	view →
Gin	Go	Library	view →
Hugo	Go	Framework	view →
Pydantic	Python	Library	view →
shadcn/ui	TypeScript	Component library	view →
Create T3 App	TypeScript	Scaffold / CLI	view →
SQLModel	Python	Library	view →

Language breakdown: 7 TypeScript, 5 Python, 2 Go, 2 JavaScript (some repos detected as both TS and JS). Repo types: 3 monorepos, 5 libraries, 3 frameworks, 2 apps, 1 component library, 1 scaffold.

Aggregate findings

135 total findings across 15 repos. The most common categories:

Hub files: Every repo has them. Files imported by 10+ other modules that most developers don't realize are critical chokepoints. Changing a hub file without understanding its blast radius causes cascading breakage.
Co-change coupling: Files that consistently change together in git history but live in different directories. These invisible dependencies are the #1 source of missed changes when agents modify one file without touching its partner.
Convention variance: Most repos have a dominant coding style (naming, imports, exports), but 60%+ also have pockets of legacy code that follow a different convention. Agents that learn the dominant pattern may produce inconsistent code in legacy areas.
Generated file traps: Auto-generated files (lock files, compiled output, type declarations) that look like source code. Agents that try to edit these waste tokens and produce broken patches.
Fragile code: Files that required many rapid edits in recent history — usually indicating areas that are hard to get right. Agents should approach these with extra caution.

What the scan doesn't capture

sourcebook's scan is structural, not semantic. It doesn't understand what the code does — only how it's organized, what depends on what, and how it's changed over time. It won't tell you that a function has a subtle off-by-one error, or that a particular API endpoint is deprecated. For that, you still need human judgment or a code-understanding LLM.

The scan also doesn't capture runtime behavior, test coverage percentages, or deployment configuration. It works with what's in the repo — source files and git history.

$ npx sourcebook init

EXPLORE_ALL_REPOS ALL_RESEARCH

MORE_FROM_SOURCEBOOK

Blog: What we found scanning 15 open-source repos arrow_forward Benchmark methodology and full results arrow_forward sourcebook vs Repomix arrow_forward