Context Is Not Free: Managing Token Costs at Scale

The prototype-to-production jump is where most AI teams hit their first real wall. The system works great in development. Then the bill arrives.

The Core Problem

Token costs don’t scale linearly with usage. Conversation history grows with every turn. Tool outputs add tokens you didn’t budget for. The users who love your product most are the ones having the longest sessions — which makes them your most expensive users.

Three Things That Actually Help

Conversation summarization at checkpoints: After every N turns, summarize the conversation so far and replace the history with the summary. Most users don’t need the model to remember exact phrasing from 10 turns ago.

Tool output truncation: Pre-process tool results before adding them to context. A web search that returns 5000 tokens of text usually contains 200 tokens of actually relevant information.

Prompt caching: If your provider supports it (Anthropic and OpenAI both do now), structure prompts so the cacheable portion is as long as possible. System prompts and few-shot examples are the obvious targets.

The economics reframe that changes how you think: stop optimizing for cost per API call. Optimize for cost per session, cost per user per month. Those are the numbers that matter for your P&L.