← All articles
Article

The Hidden Cost of Context Windows in Production Systems

February 12, 2025 · 6 min read

When you scale from prototype to production, the economics of context change completely. Here’s what nobody tells you.

The Prototype Trap

Building a prototype with an LLM is deceptively cheap. You make a few API calls, the responses are great, and you ship something that impresses people. Then you try to scale it.

The problem is that token costs don’t scale linearly with usage — they scale with the square of your context window in the worst case. Every user interaction extends the context. Every tool call adds tokens. Every piece of retrieved information costs.

What Actually Costs Money

In production, the expensive parts are almost never what you think:

System prompts are surprisingly cheap. A 2000-token system prompt on a million API calls is 2 billion tokens — significant, but predictable and easy to optimize.

Conversation history is the silent killer. If you’re naively appending every turn to the context, a 20-turn conversation at 500 tokens per turn costs 10x more than a 2-turn conversation. Your oldest users become your most expensive users.

Tool results are underestimated. A single search that returns 5 results at 300 tokens each adds 1500 tokens. A research agent that calls 10 tools per task is burning 15,000 tokens just in tool outputs — before the model has said a word.

Strategies That Actually Work

Conversation summarization at checkpoints: Don’t keep every turn in context. After N turns, summarize the conversation and replace the history with the summary. Users rarely need the model to remember exact phrasing from 10 turns ago.

Selective tool output inclusion: Most tool outputs contain more information than the model needs. Pre-process tool results to extract only the relevant portions before adding them to context.

Tiered context management: Maintain three types of context with different retention policies: immediate (last 3 turns), session (summarized recent history), and long-term (key facts extracted to a retrieval system).

Prompt caching: If your API provider supports prompt caching, structure your prompts so the cached portion is as long as possible. System prompts, examples, and tool definitions are good candidates.

The Economics Reframe

Stop thinking about token costs per API call. Start thinking about token costs per user session, per day, per month.

Model the distribution of session lengths in your system. Optimize for the long tail — the users who have 50-turn conversations are both your best users and your most expensive ones.

The teams that scale successfully don’t use cheaper models. They use smarter context management.

→ Tool Use Patterns