The promise of AI-powered development was supposed to make everything cheaper, faster, and more efficient. And in many ways, it has. But there's a growing tension simmering beneath the surface of every enterprise AI deployment: the cost of using proprietary large language models is climbing — and climbing fast — even as per-token prices hold steady or drop on paper.
This isn't a story about sticker prices. It's a story about how the way we use these models has fundamentally changed, how agentic workloads have detonated token consumption, and how companies that aren't watching carefully are discovering the damage only when the invoice arrives.
On April 15, 2026, Anthropic quietly updated the costs page on its Claude Code documentation. The average estimated daily spend per developer jumped from $6 to $13 — a 115% increase. The estimated ceiling for 90% of users climbed from $12 to $30 per active day. Enterprise deployments now budget between $150 and $250 per developer per month.
Anthropic insists nothing changed under the hood. An Anthropic spokesperson stated that the update simply reflects more recent usage data from real customers. But the effect is the same: engineering leaders who had budgeted based on the old estimates are now staring at projections that look fundamentally different.
As Anthropic's head of growth, Amol Avasare, acknowledged on X: engagement per subscriber has risen sharply, and current subscription plans weren't designed for this level of usage.
The Claude Code story is a microcosm of a much larger trend. Real-world usage costs are climbing as agentic AI becomes more widely adopted. Customers are running more agents, working with much longer context windows, and chaining more tool calls together. All of that consumes more tokens, which means higher bills — even when the per-token price hasn't moved.
One of the most insidious cost dynamics in the current LLM landscape is the gap between headline price and effective cost. The release of Claude Opus 4.7 on April 16, 2026 is the perfect case study.
The official line from Anthropic was straightforward: prices are unchanged from Opus 4.6. Five dollars per million input tokens, twenty-five dollars per million output tokens. But buried in the release notes was a critical detail — Opus 4.7 ships with a new tokenizer that can produce up to 35% more tokens for the same input text. The same paragraph of prose, the same Python function, the same JSON payload all break into more tokens in 4.7 than they did in 4.6.
A request that cost $0.10 on Opus 4.6 can now cost $0.135 on Opus 4.7. And because output tokens are priced 5x higher than input tokens, any increase in output verbosity compounds the effect on two axes at once: token density and token volume.
OpenAI followed a similar playbook with GPT-5.5, released on April 23, 2026. The company doubled the per-token price compared to GPT-5.4 — input went from $2.50 to $5.00, output from $15.00 to $30.00 per million tokens. OpenAI argues that a roughly 20% net efficiency improvement offsets the higher rate card, but for teams processing tens of millions of tokens monthly, the math is unforgiving.
| Model | Input | Cached Input | Output | Context |
|---|---|---|---|---|
| Anthropic | ||||
| Claude Opus 4.7 | $5.00 | $0.50 | $25.00 | 1M tokens |
| Claude Sonnet 4.6 | $3.00 | $0.30 | $15.00 | 1M tokens |
| Claude Haiku 4.5 | $1.00 | $0.10 | $5.00 | 200K tokens |
| OpenAI | ||||
| GPT-5.5 | $5.00 | $1.25 | $30.00 | 1.05M tokens |
| GPT-5.4 | $2.50 | $0.25 | $15.00 | 272K+ tokens |
| GPT-5.4 Mini | $0.75 | $0.075 | $3.00 | 1.05M tokens |
| GPT-5.4 Nano | $0.20 | $0.020 | $1.25 | 1.05M tokens |
Companies still running legacy Claude Opus 3 or Opus 4.1 models are paying three times what current-generation models charge for inferior performance. Migrating off legacy models is often the single highest-ROI change an organization can make to its AI budget.
| Model | Input / MTok | Output / MTok | Notes |
|---|---|---|---|
| Claude Opus 4.1 | $15.00 | $75.00 | 3x costlier than Opus 4.6 |
| Claude Opus 3 | $15.00 | $75.00 | Deprecated; still available |
| GPT-5.4 Pro | $30.00 | $180.00 | Premium reasoning tier |
| GPT-4.1 | $2.00 | $8.00 | Still strong for many tasks |
The per-token price is only one variable. What has fundamentally shifted is the volume of tokens that modern workloads consume. Several converging forces are responsible.
Gartner predicts that 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from less than 5% in 2025. These agents don't just answer questions — they plan multi-step approaches, call tools, chain API requests, and reason across long context windows. Every step consumes tokens.
According to Datadog's April 2026 State of AI Engineering report, rate-limit errors accounted for nearly a third of all LLM call failures they observed in March 2026 — approximately 8.4 million rate-limit errors total. The demand for compute is already pressing against provider capacity ceilings.
Claude Code's Agent Teams feature illustrates the math perfectly. A 3-agent team uses roughly 7x more tokens than a standard single-agent session, because each agent maintains its own context window and runs as a separate Claude instance. Auto-accept mode, which lets Claude execute file edits without human confirmation, further increases both tool-call count and session length.
Both Anthropic and OpenAI now offer 1M+ token context windows at standard pricing — a genuine engineering achievement. But the financial implication is that developers use those windows. OpenAI adds an explicit surcharge for long contexts on some models: GPT-5.5 prompts exceeding 272K input tokens are priced at 2x input and 1.5x output for the full session.
Both providers have introduced new tokenizers with their latest models. Opus 4.7's tokenizer can generate up to 35% more tokens for identical input. This hits hardest on code, structured data, and non-English text.
| Cost Factor | Rate Card Effect | Actual Bill Effect |
|---|---|---|
| New tokenizer (Opus 4.7) | Unchanged | Up to +35% |
| Multi-agent sessions (3 agents) | Unchanged | ~7x token volume |
| Auto-accept mode | Unchanged | Higher tool-call count |
| 1M context window usage | Unchanged / surcharge | 10–100x vs. short prompts |
| Agentic reasoning loops | Unchanged | 3–4x tokens per task |
| GPT-5.5 per-token increase | +100% (in and out) | Directly doubled |
The gap between "we're using AI" and "we understand what AI costs us" is where most organizations lose money. The good news is that a mature set of practices and tooling has emerged over the past year.
Route all LLM calls through a single gateway or proxy. Open-source tools like LiteLLM provide a unified interface to 100+ providers while automatically tracking tokens, latency, and costs. Commercial gateways like Bifrost add adaptive load balancing and semantic caching on top.
Tools built on OpenTelemetry — the industry standard for distributed tracing — can automatically link LLM costs to user journeys. Platforms like Langfuse (open-source) and Datadog LLM Observability (enterprise) provide per-trace cost breakdowns, custom tagging, and session-level spend analysis.
| Tool | Type | Key Strength | Pricing |
|---|---|---|---|
| Langfuse | Open-source | Trace-level cost attribution | Free (self-hosted); $29+/mo |
| LiteLLM | Open-source proxy | Multi-provider gateway, budgets | Free |
| Datadog | Enterprise APM | Unified infra + LLM view | Usage-based |
| Helicone | Proxy-based | One-line integration | Free tier (10K req/mo) |
| Braintrust | Observability + evals | Eval alongside cost tracking | Free tier; $249/mo Pro |
| CloudZero | FinOps platform | Unit-economics by customer | Enterprise |
| Finout | Cloud cost mgmt | Cross-provider allocation | Enterprise |
Not every request needs a frontier model. The price difference between GPT-5.4 Nano ($0.20/MTok input) and GPT-5.4 Standard ($2.50/MTok input) is 12x. A cascading architecture — where requests first route through a lightweight model and escalate only when complexity demands it — can reduce costs by 60–80%.
Prompt caching is the single largest cost lever available to API users. Both providers charge only 10% of the standard input price for cache hits. Many chat applications achieve 30–50% cache-hit rates from the system prompt alone, with no architectural changes required.
Both Anthropic and OpenAI offer Batch APIs at a 50% discount across all models for workloads that can tolerate up to 24 hours of latency.
Set per-team or per-user budget caps with automatic alerting when thresholds are crossed. LiteLLM supports maximum budgets per API key with automatic enforcement. The goal is catching anomalies the same day — not at month-end.
| Technique | Potential Savings | Effort | Best For |
|---|---|---|---|
| Prompt caching | Up to 90% on input | Low | Repeated prompts, multi-turn chat |
| Batch API | 50% on all tokens | Low–Medium | Offline processing, evaluations |
| Model routing | 60–80% | Medium | Mixed-complexity workloads |
| Migrate off legacy | Up to 66% | Low | Teams on Opus 3/4.1 |
| Prompt optimization | 20–40% | Medium | Verbose prompts, bloated context |
| Semantic caching | Variable | Medium | FAQ bots, repeated queries |
Model API spending doubled from $3.5 billion to $8.4 billion between late 2024 and mid-2025, and 72% of companies planned to increase their LLM spending further into 2026. The enterprise LLM market is projected to grow from $6.7 billion to $71.1 billion by 2034. These are not numbers that can be managed with spreadsheets and monthly invoice reviews.
The organizations that succeed will be the ones that treat AI cost management the way the industry learned to treat cloud cost management a decade ago: as a first-class engineering discipline, not an afterthought. That means instrumenting from day one, attributing every dollar to a business outcome, building cost awareness into architectural decisions, and continuously benchmarking whether the model and tier you're using is actually the right one for each workload.
The price of intelligence is rising. The question is whether you're paying attention before or after the bill arrives.
— Teckxx, Founder of OK ROBOT
Sign up here to receive updates on new Blog posts and all things OK-ROBOT