The LLM Doesn't Know What It Needs
Good context isn't context that fits. It's context that's relevant. The difference between those two things is the gap most AI-assisted development falls into — and closing it requires intelligence, not arithmetic.
You hand your LLM a codebase and ask it to help. It reads every file it can fit into the context window. It produces confident suggestions — and one of them adds a JOIN on a column that was renamed three migrations ago. Not because the model is bad, but because it was reasoning from everything it could see and had no way to know what actually mattered. A utility function imported everywhere got the same weight as the authentication middleware. A test fixture consumed tokens that could have carried the schema the model needed to understand the query. What the LLM had was volume, not relevance.
I’ve argued before that context quality matters more than prompting skill. That post was about the principle. This one is about what a year of building with LLMs taught me about the mechanism — the surprising parts, the parts I got wrong, and the part that was missing entirely.
Context is not a budget problem
The first version of cxpak treated context as arithmetic. You have N tokens of budget. Your codebase has M tokens of source. If M > N, truncate until it fits. Allocate percentages to sections — 30% for signatures, 20% for the module map, 15% for the dependency graph — and cut line by line when a section exceeds its share.
It worked. It was also wrong — in a way I didn’t understand until I watched how LLMs actually used the output.
Truncation is indifferent to relevance. A utility function that’s never called on the path you’re working on gets the same budget as the function at the centre of the dependency graph. Every token spent on irrelevant context is a token that could have carried something the model would actually reason about — and research confirms that irrelevant context doesn’t just waste space. Chroma’s 2025 “context rot” study found that every frontier LLM tested performed worse as input length increased, even when the relevant information was present. Noise competes with signal. It actively degrades the output.
The shift that changed everything was from budgeting to scoring. Instead of asking “what fits?”, ask “what matters?” — and compute the answer before the LLM sees anything.
Seven signals, and which ones surprised me
Determining what matters required building a relevance scoring system — and building it revealed things about how LLMs use context that I hadn’t expected.
The system scores every file in the codebase against the current task using seven weighted signals: path similarity, symbol match, import proximity, term frequency, recency, PageRank on the dependency graph, and embedding similarity. I tested these against real development tasks — debugging sessions, feature implementations, refactoring exercises — comparing which files the scoring surfaced against which files the developer actually needed to complete the task. Not a formal benchmark. A manual assessment across dozens of sessions, noting what worked and what failed miserably.
PageRank was the most surprising. Files that are imported by many other files — the hubs of the dependency graph — are disproportionately useful as context, even when they don’t directly match the query. A type definition file imported by thirty modules tells the model more about the system’s structure than the specific module you’re editing. I’d expected path similarity and symbol match to dominate. PageRank consistently outperformed both for tasks that required understanding the system rather than editing a single file. The LLM needs the hub, not just the leaf.
Recency was the least useful. Initially, I assumed recently modified files would be more relevant — they’re “active,” they represent current work. In practice, recency correlates weakly with relevance. A file modified yesterday might be a config tweak. A file untouched for six months might be the core domain model the LLM desperately needs. Weighting recency too heavily biased toward churn over structure.
Embeddings helped — but less than expected. Semantic similarity from vector embeddings adds genuine value for fuzzy, natural-language queries (“find the authentication logic”). For precise, structural queries — which are the majority of real development tasks — the deterministic signals are more reliable. The system works well without embeddings. It works slightly better with them.
The trade-off worth naming: computing this scoring layer adds time. Building the index for a large codebase takes seconds, not milliseconds. For interactive use, that cost is amortised by caching — but the cold start is real. Intelligence isn’t free. It’s cheaper than the alternative, of course: an LLM confidently reasoning from irrelevant code after taking minutes to grep twenty unrelated files is objectively worse.
The architecture your LLM can’t see
The most consequential discovery wasn’t about scoring algorithms. It was about what was missing from the context entirely.
Every backend codebase has two architectures: the code and the data model. Most context tools only see one. The LLM reads the Python handler that queries the orders table but (most of the time) has no idea what columns that table has, what foreign keys constrain it, or what the migration history looks like. It sees the ORM model class but not the database migration that added the index the query depends on. It suggests an INSERT that violates a foreign key constraint the model has never seen — confidently, because nothing in its context says otherwise.
Fixing this in cxpak required parsing SQL files for table definitions and relationships, detecting ORM patterns across languages, tracking migration sequences to understand how the schema evolved, and — the piece that connects everything — linking embedded SQL in application code back to the table definitions. When Python code contains SELECT * FROM orders JOIN users, that creates typed edges in the dependency graph linking the Python handler to the SQL table definitions. The dependency graph spans languages: application code → embedded query → schema.
The effect was the largest single improvement in output quality I observed across a year of building with Claude. When the LLM can see the data model alongside the code, it stops hallucinating column names, stops suggesting joins on dropped fields, and starts understanding why the query is shaped the way it is. The schema was the context that was always missing. No amount of relevance scoring on the code alone could compensate for its absence.
The data model was one blind spot. Conventions were another. Every codebase has a DNA — naming patterns, error handling style, testing density, import structure, dependency layering — that shapes what “correct” looks like in that specific project. The answers are in the code itself, observable but undocumented. Developers capture fragments of this manually in CLAUDE.md files and Cursor rules — I wrote about this as proto-ADRs. But the conventions are already there in the codebase, waiting to be extracted. When the LLM knows that this project uses snake_case, returns Result types instead of unwrapping, and keeps modules under 200 lines — because it observed the pattern, not because someone wrote it down — the generated code stops feeling foreign.
These blind spots are not specific to my tooling. They’re structural: most AI coding assistants still treat SQL as strings by default and rely on hand-written instructions for conventions. Some are evolving — but until schema awareness and convention detection are the default rather than the exception, the gap between what the LLM sees and what it needs will persist.
Degrade gracefully, don’t truncate
The last insight was about what to do when the budget runs out — because it inevitably does.
The first version truncated: include everything until you hit the limit, then cut. The problem is that truncation creates a cliff. Files near the budget boundary get full detail. The file that would have been next gets nothing. The LLM’s context window has a sharp edge where relevant information drops to zero.
The replacement is progressive degradation — variable fidelity applied per file based on relevance and budget. The highest-priority files get full source. The next tier gets trimmed implementations. Lower-priority files get signatures and documentation comments only. The lowest tier gets a stub — just the file path and its role in the dependency graph.
The result is that the LLM gets maximum coverage at any budget. A tight budget doesn’t mean half the codebase is invisible — it means the whole codebase is visible at varying levels of detail, with the most relevant files at full fidelity and the least relevant as structural landmarks. The model can see the shape of the entire system and the detail of the parts that matter.
The LLM is powerful, but it’s passive — it works with whatever you give it, and it cannot accurately ask for what’s missing. The quality of the output is determined before the prompt is sent, by the quality of the context it receives. A year of building daily with LLMs taught me that the gap between good and bad AI output is almost never the model. It’s the context — whether it was scored for relevance or truncated for budget, whether it included the data model or ignored it, whether it degraded gracefully or cut blindly. The context is the product. Everything after it is a consequence. cxpak is open source — it’s how I build this context for every project I work on.