What We Found Scanning 15 Open-Source Repos With 100,000+ Files

We wanted to know what sourcebook actually finds when pointed at repos it's never seen before. Not toy projects — the real ones. The repos that power production apps, that have hundreds of contributors, that have been refactored and patched and hotfixed for years.

So we ran it on 15 open-source repositories. The smallest was Gin at 130 files. The largest was Next.js at 27,423. Combined: over 85,000 files scanned, 135 findings extracted, zero API keys used, everything local.

The repos: Next.js, Cal.com, Supabase, Django, FastAPI, Express, Flask, Hono, Fastify, Gin, Hugo, Pydantic, shadcn/ui, Create T3 App, and SQLModel.

Here's what we found.

Every repo has hub files — and agents don't know about them

Every single repo had between 2 and 5 files imported by 10 or more other files. In Next.js, the number was higher — several modules had 40+ dependents. Cal.com had types.ts files imported by dozens of packages across its monorepo.

These are the files that break everything when changed. An agent editing a utility function and editing a core type definition should behave very differently — but without structural analysis, it treats them the same. It has no idea that changing one file cascades through 30 others.

This showed up in every repo we scanned. Not some — every one. Hub files are a universal feature of codebases past a certain size, and they're invisible to agents that read files one at a time.

TypeScript strict mode is off in repos you'd expect to have it on

Next.js and Cal.com both run without TypeScript strict mode. This matters for agents because strict mode changes what's valid TypeScript. An agent that adds strictNullChecks-style assertions to a non-strict codebase is writing code that's technically correct but stylistically wrong — and it'll look out of place in every PR review.

This is the kind of convention an agent can't learn from reading a single file. It needs to see the tsconfig.json and understand its implications across the codebase. sourcebook flags this automatically, and it's one of the findings agents use most.

Co-change coupling is real and invisible

In Cal.com, we found file pairs that changed together in over 80% of commits — but had no import relationship. They lived in different directories. No static analysis would connect them. The only signal was in the git history: when one changes, the other always changes too.

Supabase had a similar pattern between its auth helpers and its database types. Django showed it between settings modules and URL configurations. This is the coupling that causes agents to submit PRs that pass CI but break something three directories away.

You can't find this by reading imports. You need git forensics — looking at which files move together across hundreds of commits. That's what sourcebook does, and it was one of the most consistently valuable findings across all 15 repos.

Convention counts vary wildly

Cal.com produced 21 findings. Pydantic produced 4. Express and Fastify each had 4. This makes sense — Cal.com is a monorepo with multiple apps, custom testing patterns, Prisma schemas, tRPC routers, and Tailwind. Pydantic is a focused library with a tight scope.

The takeaway: the size of the repo doesn't determine how much context an agent needs. Complexity does. A 3,000-file library with a clean architecture might need less context than a 3,000-file app with three ORMs and two test frameworks.

The same 5 patterns, everywhere

Across all 15 repos, we saw the same tooling patterns repeat: pytest or Vitest for testing, Prisma or SQLAlchemy for database access, tRPC or REST for API layers, Tailwind for styling, and some form of barrel exports for module organization.

This is useful because it means convention detection doesn't need to handle infinite variety. The ecosystem has converged. What matters is which combination a specific repo uses and what deviations it has from the default setup — those deviations are the non-discoverable information agents need.

Generated files are a trap

Several repos had files that look like hand-written code but are actually auto-generated. Supabase has generated TypeScript types from its database schema. Next.js has generated route manifests. Django has auto-generated migration files.

An agent that edits these files will produce a working diff that gets overwritten on the next build. It's wasted work at best — and a confusing regression at worst, if someone commits the agent's changes and then a build clobbers them. sourcebook detects generated file markers and flags them so agents know not to touch them.

What this means for agent context

The six findings above have one thing in common: none of them are discoverable by reading a single file. Hub file importance requires graph analysis. Co-change coupling requires git mining. Convention deviations require cross-file comparison. Generated file detection requires pattern matching against known markers.

This is exactly what should go in a context file — and exactly what most hand-written context files leave out. Nobody writes "types.ts is imported by 15 files, treat it carefully" in their CLAUDE.md. They write "this is a Next.js project using TypeScript." The agent already knows that.

The data from these 15 repos reinforces the core principle: the only context that helps is what the agent can't figure out on its own. Everything else is noise.

You can explore all 15 repos yourself at sourcebook.run/for.

What we found scanning 15 open-source repos with 100,000+ files