Long context windows are getting larger. Claude now offers a million token context. Gemini goes further. The assumption is clear: more context means better results.
That assumption is wrong more often than it is right. And token cost scales regardless.
A recent discussion on r/ClaudeAI captured the frustration perfectly. Users loading entire codebases into million-token windows found the model struggling with simple tasks it had handled easily with smaller, curated context. The blessing became a curse. More information did not produce better answers—it produced slower, more expensive, less accurate ones.
This is not a bug in the model. It is a fundamental misunderstanding of how attention mechanisms work, and why retrieval still matters even when you can fit everything.
The Attention Problem Nobody Talks About
Transformer models do not treat all tokens equally. Attention is a weighted distribution, and those weights have to go somewhere. The model has a limited token budget for attention. When you load 800,000 tokens of codebase into a context window, the model has to decide what matters for your specific query.
It does not always decide well.
Research on attention distribution shows consistent patterns: models tend to over-weight tokens near the beginning and end of the context, with a "lost in the middle" effect for content in between. A function buried at token 400,000 might as well not exist if the model's attention budget has already been spent on the boilerplate at the top of the file and the unrelated utility functions loaded after it.
The result: queries that worked perfectly with 20,000 tokens of carefully selected context start failing with 500,000 tokens of everything. The signal gets diluted in noise.
Why "Just Load Everything" Feels Right But Is Not
The appeal is obvious. Retrieval is hard. Deciding what context matters requires judgment, tooling, and iteration. Loading everything sidesteps all of that.
Except it does not.
A 20,000-token context window forces curation. You have to think about what the model actually needs to answer your question. That constraint produces better results because it front-loads the work of relevance filtering to the human—who understands the task—rather than delegating it to attention weights that do not.
Large context windows remove the constraint without removing the underlying problem. The relevance filtering still has to happen. It just happens inside the model, where you cannot see it, debug it, or improve it.
And the meter keeps running. Million-token prompts are not free. The computational cost of attention scales quadratically with sequence length. What feels like a convenience is often an expensive way to get worse results.
The Retrieval Paradox
Here is where it gets interesting. The teams getting the best results from large context windows are the ones who still invest heavily in retrieval.
They use the expanded capacity strategically: loading the retrieved context alongside additional supporting material, not as a replacement for retrieval but as a complement to it. The core relevant content comes from targeted search. The extra tokens provide surrounding context that helps the model understand relationships and dependencies.
This is the opposite of the "just load everything" approach. It treats large context windows as headroom for better retrieval, not as an excuse to skip it.
The pattern shows up repeatedly in production retrieval augmented generation systems. Teams that benchmark "full context" against "retrieval plus context" consistently find the latter wins on accuracy, latency, and cost. Some even apply context compression to squeeze more signal into fewer tokens. The constraint of retrieval is not a limitation—it is a feature.
What Actually Works
The practical implications are straightforward:
Smaller, focused context beats larger, unfocused context. A 10,000-token window with exactly the right code is more useful than a 500,000-token window with everything. The model's attention budget concentrates on what matters.
Retrieval quality matters more than context size. Investing in better embedding models, a smarter chunking strategy, and query understanding produces better results than expanding the context window. Token efficiency improves. Returns go up and costs go down.
Large context windows are useful for depth, not breadth. They shine when you need to include a full module alongside its dependencies, or when the model needs to understand a complex interaction between components. They fail when used as a substitute for knowing what to include.
Cost scales faster than value. The relationship between context size and output quality is not linear. Doubling the context does not double the usefulness. But it does more than double the compute cost.
The Uncomfortable Truth
More context is not better context.
The teams getting the most out of modern LLMs are not the ones loading everything into million-token windows. They are the ones who understand that relevance is a problem retrieval solves and attention does not.
Large context windows are a tool, not a strategy. They expand what is possible without changing what is effective. The fundamentals have not shifted: know what matters, provide what is needed, skip what is not.
Most teams will ignore this. They will load everything, pay for everything, and wonder why results got worse when the context window got bigger.
The model does not understand your codebase better because it can see more of it. It understands your codebase better when it sees the right parts of it.
That distinction is the whole game.