Enterprise AI's question has shifted from "how do we build agentic frameworks?" to "how do we run them without the token economics killing the business case?" This article breaks down exactly where every token goes, layer by layer, and what those numbers look like when you're running AI at scale.
If you've ever built something with AI for personal projects (a research workflow, a coding assistant, a personal automation), you know the wall. Mid-build, mid-thought, sometimes mid-sentence: "You've reached your usage limit. Try again in 5 hours." For some plans it's 2 hours. For heavy sessions, a full day. For the most intensive runs, you're waiting until the next weekly allocation resets. For a personal project, that's an inconvenience you work around.
In an enterprise setting, the problem is different in kind, and orders of magnitude more costly. There's no waiting it out. Thousands of users, dozens of concurrent agentic workflows, agents calling sub-agents, each step re-sending its entire conversation history. Tokens don't just accumulate; they compound. A 5-step agent loop consumes 22× what a direct answer costs. A multi-agent pipeline: 64×. Scale those multipliers across an organization and you're not looking at a quota inconvenience; you're looking at an architecture problem that grows with every new deployment until it threatens the ROI of the entire AI program.
"A lot of the unusual things about LLMs that you might find surprising actually trace back to tokenization."
Andrej Karpathy · Let's Build the GPT Tokenizer (Feb 2024)
Context cutoffs, arithmetic failures, "lost in the middle" — most aren't model failures. They're token failures. Token optimization starts at the unit level.
"The resources used to train the model can be repurposed to run millions of instances of it — and the model can absorb information and generate actions at roughly 10x–100x human speed."
Dario Amodei · Machines of Loving Grace (Oct 2024)
The bottleneck is no longer getting a model to work. It's running millions of instances economically. Token efficiency is the infrastructure problem at that scale.
"GPU computing is increasingly turning into a commodity, metered per hour."
Sequoia Capital · AI's $600B Question (2024)
When compute is a commodity, the moat shifts to whoever uses it most efficiently — not the team with the most GPUs, but the team that extracts the most value per token spent.
"The structure of AI pricing demands that users take care to use as few tokens as possible — that's the basic unit of cost."
Scale AI · Text Is the Universal Interface (2021, still true)
Written when GPT-3 launched. The observation has only grown more urgent as agentic systems multiplied token consumption 20–64×. Prompt engineering is now cost engineering.
The unit behind all of it is the token. Understanding where they accumulate, from a single API call to a full orchestrated pipeline, is the difference between AI infrastructure that scales and one that doesn't. This guide breaks it down live, layer by layer, with context from the researchers and investors already watching this shift closely.
What is a token?
A token is roughly 0.75 words, the basic unit LLMs use to process text. "The quick brown fox" is 4 tokens. Every API call has two sides: tokens in (your prompt) and tokens out (the model's response). You pay for both, and output tokens typically cost 3–5× more than input tokens. That asymmetry matters a lot once you see how much of an agentic loop is re-sent input context.
Level 1: A single model call
The simplest case. You send a message, you get a response. But even here, there's more going on than most people realize.
live
Single API call — token flow
scroll to play
system 200 tok
You are a helpful assistant. Answer questions accurately and concisely. If you're unsure, say so rather than guessing.
user 50 tok
What's the difference between TCP and UDP? When would you choose one over the other?
assistant 300 tok out
TCP is connection-oriented — it establishes a handshake, guarantees delivery, and orders packets. UDP is connectionless: faster but no guarantees packets arrive or arrive in order. Use TCP where reliability matters (HTTP, databases, file transfers). Use UDP where speed beats reliability and you can tolerate some loss (live video, DNS lookups, online games).
The thing people often miss: every single call includes your system prompt, even when you're just asking a follow-up question. If your system prompt is 500 tokens and you send 50 conversational messages, that's 25,000 tokens in system prompt overhead alone, before the user has typed anything substantial.
Key insight
The system prompt is an invisible tax on every call. 500 tokens × 10,000 daily users = 5 billion tokens per day in system prompt overhead alone. Write it tight.
Level 2: RAG, when the model looks things up
Retrieval-Augmented Generation (RAG) adds a retrieval step before the LLM call. When you ask a question, the system first searches a knowledge base, pulls relevant chunks, and adds them to the prompt before calling the model. The result is a much richer, more accurate answer. The cost is that your input token count inflates substantially.
live
RAG call — retrieval inflates the input
scroll to play
user 50 tok
What's the optimal chunk size for a production RAG system?
system 200 tok
You are a technical assistant. Use only the retrieved context below to answer. Cite sources when possible.
retrieved · chunk 1/3 · rag-handbook.md 800 tok
"Chunk size is one of the most impactful hyperparameters in RAG. In production, 512–1024 token chunks typically balance retrieval precision with enough context for the model to answer well. Very small chunks (<128 tok) improve recall precision but strip surrounding context the LLM needs. Very large chunks (>2048 tok) dilute relevance scores and cost more per query..."
retrieved · chunk 2/3 · benchmarks-2024.pdf 1,100 tok
"Across 47 production RAG deployments in this benchmark, median chunk size was 800 tokens. Retrieval recall peaked at 512-token chunks (0.87 recall). Answer quality, however, peaked at 512–768 tokens — suggesting an optimal zone that balances both metrics. Systems below 256 tokens showed 23% drop in answer coherence..."
retrieved · chunk 3/3 · design-patterns.md 900 tok
"For domain-specific corpora (legal, medical, code), overlapping chunks of 256–512 tokens with 20% overlap typically outperform larger chunks. For general knowledge bases, 512–1024 tokens is the practical sweet spot. Rule of thumb: set k based on your context budget — 3 chunks × 900 tok = 2,700 tok of context before system prompt overhead..."
assistant 400 tok out
For production RAG, 512–768 tokens per chunk is the practical sweet spot — large enough to retain meaningful context, small enough to keep retrieval precise. Retrieve 3–5 chunks by default. For structured documents (tables, code), use smaller chunks with more overlap. Avoid chunks >1,500 tokens; retrieval quality degrades and cost per query climbs fast.
2,800 of 3,200 input tokens are retrieved context — before the model has said a word. Chunk count (k) and chunk size are your main cost levers. Anthropic's contextual retrieval research found top-20 chunks beat top-5 on quality, but at proportional token cost — more context always trades against spend.↗ Anthropic
Level 3: Agentic loops, where tokens start compounding
An AI agent doesn't answer once; it plans, calls tools, reads results, and loops. What makes this expensive isn't the number of tool calls. It's that every step sends the entire conversation history again. LLMs have no memory between calls, so step 3 re-pays for steps 1 and 2, step 5 re-pays for all four before it. The context grows like a snowball, and you pay for every flake on every roll.
live
Agentic loop — context re-sent every step
scroll to play
Step 1 Plan 550 tok · call #1
system 200 tok
You are a research agent. Plan and execute multi-step research tasks using available tools. Break complex queries into focused subtasks.
user 21 tok
Research the top 3 open-source LLMs released in 2025 and compare their capabilities.
tool_call 18 tok
web_search({ "query": "top open-source LLM releases 2025 benchmark comparison" })
tool_result 311 tok
{ "results": ["Llama 3.3 70B released Jan 2025, tops instruction-following benchmarks...", "Qwen 2.5-72B achieves SOTA on coding tasks...", "Mistral Small 3.1 24B best efficiency-to-performance..."] }
Step 2 Fetch detail 1,200 tok · call #2
↻
Step 1 re-sent in full (system + user + tool call + result)
550 tok re-sent
tool_call 23 tok
fetch_page({ "url": "huggingface.co/meta-llama/Llama-3.3-70B", "section": "model_card" })
tool_result 627 tok
{ "model": "Llama-3.3-70B", "context_length": 128000, "benchmarks": { "MMLU": 86.0, "HumanEval": 88.4, "GSM8K": 95.1 }, "license": "Llama Community License", "notes": "Best-in-class at 70B for instruction following and multilingual tasks..." }
Step 3 Continue research 2,100 tok · call #3
↻
Steps 1–2 re-sent in full
1,200 tok re-sent
tool_call 30 tok
fetch_page({ "url": "qwen2.5-72b-benchmark-report", "section": "results_summary" })
tool_result 870 tok
{ "model": "Qwen2.5-72B-Instruct", "benchmarks": { "MMLU": 87.2, "HumanEval": 92.1, "MATH": 83.1 }, "strengths": "coding, multilingual, long-context (128k)", "license": "Apache 2.0", "notes": "Strongest coding performance at 72B, competitive with GPT-4o on HumanEval..." }
Step 4 Synthesize 3,300 tok · call #4
↻
Steps 1–3 re-sent in full
2,100 tok re-sent
assistant (drafting) 1,200 tok
Based on retrieved data, here's a draft comparison. Llama 3.3 70B leads in instruction-following at 70B scale. Qwen 2.5 72B leads on coding and multilingual tasks. Mistral Small 3.1 24B is the efficiency winner. Let me refine into a final answer...
Step 5 Final response 4,800 tok · call #5
↻
Steps 1–4 re-sent in full
3,300 tok re-sent
assistant (final) 1,500 tok
Top 3 open-source LLMs of 2025 — Llama 3.3 70B: best overall instruction-following, 128k context, ideal for general use. Qwen 2.5 72B: strongest coding and multilingual, Apache 2.0 license. Mistral Small 3.1 24B: best efficiency-to-performance ratio, 128k context, ideal when compute is constrained.
Five steps, 11,950 tokens. A single direct answer to the same question: 550 tokens. That's a 21.7× difference, before adding any RAG retrieval. If your agent also fetches documents at each step, you're looking at 50–100× easily.
"Just add an agent" is not a free upgrade; it's a different cost model entirely. 5 steps doesn't cost 5×. It costs 22×. Anthropic's guide on building effective agents is direct: "agentic systems often trade latency and cost for better task performance." Start simple. Add orchestration only when you've exhausted the simpler approach.↗ Anthropic
Key insight
The cost isn't additive; it's triangular. A 10-step agent costs roughly 55× a single call on input tokens alone. Each step pays for every step before it.
Level 4: Agent-to-agent, the token explosion
Multi-agent systems multiply the agentic cost model by the number of agents. Each sub-agent runs its own context-accumulating loop independently. The orchestrator then pays to read all their outputs and synthesize a response. By the time an answer reaches you, every agent has quietly run up its own bill.
live
Agent-to-agent — orchestrator dispatches sub-agents
scroll to play
Orchestrator
planning + synthesis · 3 API calls
~4,500 tokens
call 1: decompose task → assign to 3 agents
call 2: coordinate mid-task
call 3: synthesize all agent outputs
Research Agent
web search + retrieval
~9,000 tokens
step 1: plan search
step 2: web_search × 2
step 3: fetch_url × 1
step 4: compile findings
each step re-sends prior context
Analysis Agent
reasoning + comparison
~12,500 tokens
step 1: ingest research
steps 2–4: reason across angles
step 5: structured output
context grows to ~6k by step 5
Writing Agent
drafting + formatting
~9,500 tokens
step 1: outline
steps 2–3: draft sections
step 4: format + polish
high output token cost
Orchestrator
synthesis · reads all agent outputs
included in ~4,500 total
reads ~31k tokens of agent output
produces final unified response
This is a modest example — 3 agents, 4–5 steps each. Real pipelines run 5–10 agents with RAG at each step, pushing token consumption into the hundreds of thousands per task. Engineers running parallel agentic workflows have logged enough in a single session to exhaust a full day's rolling usage limit many times over.↗ Latent.Space
How it stacks up
Tokens per task — relative scale (max = 35,500)
What to do about it
Individual / personal projects
Match model to task. Don't send long documents to GPT-4o for a summary. GPT-4o mini, Gemini Flash, or Claude Haiku handle most tasks at 10–20× lower cost.
Restart conversations regularly. Your whole conversation history is re-sent on every turn. Long threads accumulate fast. Start a new chat when topics change.
Compress your system prompts. Every token in your system prompt is paid on every single call. Keep it tight.
Think before running agents. If you can answer the question with a single well-crafted prompt, do that first. Agents are for tasks that genuinely require multi-step planning.
Set spending limits. All major providers let you set hard monthly spend caps. Use them.
Enterprise teams
Instrument everything. Token usage per user, per workflow, per agent step. You can't manage what you don't measure.
Set per-user rate limits. One runaway agentic workflow can generate more tokens in an hour than your entire organization normally uses in a day.
Audit your RAG chunk strategy. Retrieving 10 large chunks per query may not be better than 3 well-targeted small ones, and costs 3× more per call.
Evaluate self-hosting for high-volume workflows. Agentic pipelines run 12–100K tokens per task. At scale, a fixed GPU cost often beats variable API spend.
Data residency is a real concern. In agentic workflows, sensitive data passes through the model repeatedly across many calls. Know where your tokens go.
Quick reference
| Setup |
Typical tokens / task |
vs single call |
Main driver |
| Single call |
300–1,000 |
1× |
System prompt + message |
| RAG call |
2,000–10,000 |
5–15× |
Retrieved chunk count and size |
| Long conversation (20 turns) |
10,000–50,000 |
20–90× |
Full history re-sent each turn |
| Agentic loop (5 steps) |
8,000–25,000 |
15–45× |
Accumulating context per step |
| Agentic + RAG (5 steps) |
20,000–80,000 |
40–150× |
Context + retrieved docs × steps |
| Multi-agent A2A (3 agents) |
30,000–120,000 |
55–220× |
Agents × steps × context |
Reading list