Token Waste Is a Solvable Problem

Token costs dropped 98% in two years. Models that cost $20 per million tokens in late 2022 now run under $0.50. The economics should be getting easier.

They are not. Somehow, the bills keep climbing. Token usage is up across the board, and most teams have no idea where the budget is actually going.

Costs Dropped, Consumption Exploded

Cheaper tokens changed how developers use AI tools. What started as occasional code completion became constant interaction. Agentic workflows emerged—tools that call models repeatedly, autonomously, chaining prompts to accomplish multi-step tasks. Reasoning models introduced "thinking tokens" that run internal chains of thought before responding.

The price per token fell. The tokens per task climbed faster. It is like getting a cheaper gym membership and then hiring a personal trainer who bills by the hour.

Teams that set a token budget based on early usage patterns are finding their estimates obsolete. A workflow that cost $50/month a year ago might cost $500/month today—not because rates increased, but because the tools got hungrier. More capable, yes. Also more expensive to feed.

The Predictability Problem

The frustration is not just about cost. It is about not knowing what the cost will be.

A developer spinning up an AI-assisted workflow cannot easily predict what it will cost to run. Token consumption varies based on prompt length, response complexity, how many retries the model needs, whether the task triggers agentic loops. The same query might cost $0.02 one day and $0.40 the next. Same question, different mood from the model, twenty times the bill.

Budgeting becomes guesswork. Teams either over-provision and waste money, or under-provision and hit rate limits mid-project. Neither feels good.

The secondary cost is hesitation. Developers start second-guessing whether to use AI assistance at all, afraid that iterating will burn through their allocation. The tool that is supposed to accelerate work becomes something to ration. A tool people avoid using is not a tool. It is overhead with good marketing.

Where Tokens Go

Most token waste hides. Three places, usually:

Bloated context. Files included "just in case." Verbose system instructions. Conversation history that stopped being relevant ten messages ago. All of it counts. A 50K token context window costs the same whether those tokens are useful or noise. Most teams never check the actual token count—they just see the bill.

Failed iterations. Model returns something wrong, developer tries again with a longer prompt. More examples, additional context. Each retry burns tokens. If the underlying issue is retrieval quality or prompt structure, more tokens will not fix it. They will just make the failure more expensive.

Invisible retries. Agentic tools often retry internally when a step fails. The developer sees one request; the system makes five. Sometimes necessary. Also invisible, untracked, and on the bill.

Making It Manageable

Token waste becomes solvable once it becomes measurable.

The first step is visibility: knowing where tokens are actually going. Not just total spend, but spend by task type, by workflow, by time of day. Patterns emerge. That one query type that seemed harmless is actually consuming 40% of the budget. The nightly batch job is retrying endlessly because of a malformed prompt.

The second step is thresholds: setting a token limit per task that prevents runaway consumption before it happens. Not hard caps that break workflows, but alerts that surface anomalies. If a task typically uses 10K tokens and suddenly requests 100K, something changed. Catching that early is the difference between a minor investigation and a billing surprise.

The third step is token optimization: once the waste is visible, reducing it becomes straightforward. Tighter context. Better retrieval. Context compression that strips irrelevant information before it reaches the model. Prompt structures that reduce retries. None of this is exotic—it is just invisible until it is measured.

The Compound Effect

Small inefficiencies compound. A 20% reduction in tokens per task does not just save 20% on the bill. It means faster responses (less to process), better output quality (less noise for the model to process), and more headroom for the tasks that actually need larger context.

Teams that treat token efficiency as an optimization target—not an afterthought—end up with better economics and better results. The two are not in tension.

The Point

Token costs are not inherently unpredictable. They are unpredictable when token usage is invisible and unmanaged. Add measurement, add thresholds, add intentional optimization, and the problem shrinks.

The tools exist. The techniques are known. The question is whether it is treated as a problem worth solving or just a cost of doing business.

The teams that treat it as worth solving tend to have both lower bills and better output. Those two results are connected.

Third in a series on context management for AI-assisted development.