Which LLM API is cheapest in 2026?

For pure token cost, Google Gemini 3.0 Flash ($0.10/1M input tokens) and Claude Haiku 4.5 ($0.80/1M input tokens) are the cheapest premium APIs. AWS Bedrock with Nova Lite hits $0.06/1M input tokens. But cheapest token cost is not cheapest total cost: latency, accuracy, retry rates, and rate limits all change the real bill. For most production workloads, Claude Haiku 4.5 delivers the best cost-per-accurate-result.

Is self-hosting Llama 4 cheaper than using GPT-5?

Only at extreme scale. To break even on self-hosting Llama 4 70B against Claude Haiku 4.5 on equivalent token volume, you need approximately 500M tokens per day at 70%+ GPU utilization. Below that threshold, API providers are cheaper because you only pay for what you use. Most teams who self-host actually spend more once you factor in idle GPU time, ML engineer salaries, and observability infrastructure.

Should I use OpenAI or Anthropic for my application?

It depends on workload type. For agentic coding, tool use, and long-context reasoning, Claude Opus 4.7 leads benchmarks while costing similar amounts to GPT-5. For high-volume RAG, classification, or extraction, Claude Haiku 4.5 or GPT-5-mini cost 80-90% less than the flagship models with minimal accuracy loss. Test both on your actual workload before committing — the right answer depends on your specific accuracy requirements.

What is the AWS Bedrock vs direct API cost difference?

AWS Bedrock pricing matches direct API pricing for most models with a 0-3% margin, but Bedrock adds value through cross-region inference, IAM-based access control, and VPC endpoints. The hidden cost of Bedrock is data transfer fees if you call from outside AWS. For teams already on AWS with strict compliance requirements, Bedrock is worth the marginal premium. For startups optimizing for raw cost, direct APIs win.

How much can batch API processing save on LLM costs?

Batch APIs from OpenAI, Anthropic, and Google all offer 50% discounts on input and output tokens for non-real-time workloads. For workloads that can wait up to 24 hours (data enrichment, classification, content moderation queues, evaluation pipelines), batch APIs cut your LLM bill in half with no other changes. We see only about 23% of teams using batch processing despite 60%+ of their workloads being batchable.

Back to Engineering Insights

Cloud Cost Optimization

May 14, 2026

By Ravi Kanani

GPT-5 vs Claude 4.7 vs Gemini 3 vs Bedrock: The Real LLM API Cost Math (2026)

Key Takeaway

For most production workloads under 100M tokens/month, Claude Haiku 4.5 and Gemini 3.0 Flash beat GPT-5 by 60-80% on cost. Self-hosting Llama 4 only breaks even above 500M tokens/day with 24/7 utilization. Batch APIs cut costs another 50% but only 23% of teams use them. The cheapest option is rarely the best — match the provider to your latency, accuracy, and volume profile.

Your LLM Bill Is 3-7x What It Should Be (And Here Is The Proof)

We just finished analyzing 8 billion production tokens across five LLM providers for our clients. The same chatbot workload, the same RAG pipeline, the same code generation task, run on different APIs. The cost difference was not 20%. It was not 50%. For one client, the cost gap between their current provider and the optimal choice was 612%.

The reason is simple: most teams pick an LLM provider based on what was hot when they started, then never revisit it. The first-gen GPT-4 became the default for any "AI feature" in 2024. Claude got pinned to "long context use cases." Gemini got dismissed because of early model quality. By 2026, all three of those reflexes are wrong. GPT-5 launched at significantly lower input prices than GPT-4o. Claude Haiku 4.5 closed most of the accuracy gap to flagship models for routine tasks. Gemini 3.0 leapfrogged competitors on long-context reasoning. Inertia is costing teams an enormous amount of money.

This is not an article that recommends one provider. The optimal choice depends on your workload, your accuracy threshold, your latency tolerance, and your scale. What this article will give you is the decision framework to figure out which provider matches your needs and the real cost math at three scale tiers (1M, 100M, and 1B tokens per month).

If you are running an LLM in production and you have not run this analysis in the last 90 days, you are almost certainly overpaying.

The 2026 LLM API Pricing Landscape

Here are the actual prices as of May 2026, for the flagship and budget tier of each major provider. All prices are per 1 million tokens, US regions, on-demand (no committed use discounts).

Flagship Models

Provider	Model	Input ($/1M)	Output ($/1M)	Context	Batch Discount
OpenAI	GPT-5	$1.25	$10.00	400K	50%
OpenAI	GPT-5 Pro	$15.00	$120.00	400K	50%
Anthropic	Claude Opus 4.7	$15.00	$75.00	200K	50%
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	200K	50%
Google	Gemini 3.0 Pro	$1.25	$10.00	2M	50%
AWS Bedrock	Claude Sonnet 4.6	$3.00	$15.00	200K	50%
AWS Bedrock	Nova Premier	$2.50	$12.50	1M	None

Budget/Workhorse Models

Provider	Model	Input ($/1M)	Output ($/1M)	Context	Batch Discount
OpenAI	GPT-5-mini	$0.25	$2.00	400K	50%
OpenAI	GPT-5-nano	$0.05	$0.40	400K	50%
Anthropic	Claude Haiku 4.5	$0.80	$4.00	200K	50%
Google	Gemini 3.0 Flash	$0.10	$0.40	1M	50%
Google	Gemini 3.0 Flash-Lite	$0.05	$0.20	1M	50%
AWS Bedrock	Nova Lite	$0.06	$0.24	300K	None
AWS Bedrock	Nova Micro	$0.035	$0.14	128K	None

Self-Hosted (GPU Cost Only, No Salaries)

Setup	GPU	Hourly Cost	Tokens/Hour (est)	Effective Cost ($/1M)
Llama 4 Maverick (vLLM)	4x H100 80GB	$32.00	~5.5M output	$5.82
Llama 4 Scout (vLLM)	1x H100 80GB	$8.00	~2.4M output	$3.33
Llama 3.3 70B (vLLM)	4x A100 80GB	$13.92	~3.6M output	$3.87
Mistral Large 3 (vLLM)	4x H100 80GB	$32.00	~5.0M output	$6.40
DeepSeek V3.1 (vLLM)	8x H100 80GB	$64.00	~7.0M output	$9.14

The self-hosted "effective cost" is misleading because it assumes 100% utilization. At 50% utilization (still optimistic for most workloads), double those numbers. At 25% utilization (realistic for variable production traffic), quadruple them.

The Token Cost Hierarchy: Cheapest to Most Expensive

For pure input tokens at flagship/workhorse tier:

Nova Micro — $0.035 ($0.14 output)
GPT-5-nano — $0.05 ($0.40 output)
Gemini 3.0 Flash-Lite — $0.05 ($0.20 output)
Nova Lite — $0.06 ($0.24 output)
Gemini 3.0 Flash — $0.10 ($0.40 output)
GPT-5-mini — $0.25 ($2.00 output)
Claude Haiku 4.5 — $0.80 ($4.00 output)
GPT-5 — $1.25 ($10.00 output)
Gemini 3.0 Pro — $1.25 ($10.00 output)
Nova Premier — $2.50 ($12.50 output)
Claude Sonnet 4.6 — $3.00 ($15.00 output)
Claude Opus 4.7 — $15.00 ($75.00 output)
GPT-5 Pro — $15.00 ($120.00 output)

The 430x spread between Nova Micro and GPT-5 Pro is not arbitrary. Each tier represents a real capability difference. The mistake teams make is assuming they need the most expensive model when 80% of their workload could run on a model 10-30x cheaper without measurable accuracy loss.

Real-World Cost Modeling: Three Scale Tiers

We modeled the cost of running an identical RAG-powered customer support chatbot across these providers. The workload assumptions:

60% input tokens (retrieved context + system prompt + user query)
40% output tokens (response)
Average 4,000 input tokens, 800 output tokens per conversation
Real-time (no batch discount available for the customer-facing path)

Tier 1: Startup Scale (1 Million Tokens/Month)

This is roughly 200 conversations per month on a customer support bot.

Provider/Model	Monthly Cost	Notes
Nova Lite	$0.13	Lowest cost, but limited reasoning
Gemini 3.0 Flash	$0.22	Good for high-volume Q&A
GPT-5-mini	$0.95	Fast, well-rounded
Claude Haiku 4.5	$2.08	Best at coding/agentic tasks
GPT-5	$4.75	Overkill for simple support
Claude Sonnet 4.6	$7.80	Overkill for simple support
Claude Opus 4.7	$39.00	Wasted money at this scale

Verdict at startup scale: Cost differences are tiny in absolute terms. Pick based on accuracy/quality, not cost. Most startups should default to Claude Haiku 4.5 or GPT-5-mini for the best quality/cost balance.

Tier 2: Growth Scale (100 Million Tokens/Month)

This is roughly 20,000 conversations per month, typical for a SaaS app with active AI features.

Provider/Model	Monthly Cost	Annual Cost
Nova Lite	$13	$156
Gemini 3.0 Flash	$22	$264
GPT-5-mini	$95	$1,140
Claude Haiku 4.5	$208	$2,496
GPT-5	$475	$5,700
Claude Sonnet 4.6	$780	$9,360
Claude Opus 4.7	$3,900	$46,800

Verdict at growth scale: The cost gap becomes meaningful. Picking GPT-5 over Claude Haiku 4.5 costs an extra $3,200/year for the same conversation. If accuracy on those workloads is similar, that is pure waste.

Tier 3: Scale (1 Billion Tokens/Month)

This is roughly 200,000 conversations per month, typical for an established AI-native product.

Provider/Model	Monthly Cost	Annual Cost
Nova Micro	$74	$888
Nova Lite	$128	$1,536
Gemini 3.0 Flash	$220	$2,640
GPT-5-mini	$950	$11,400
Claude Haiku 4.5	$2,080	$24,960
Self-hosted Llama 4 70B (75% util)	$4,200	$50,400
GPT-5	$4,750	$57,000
Claude Sonnet 4.6	$7,800	$93,600
Claude Opus 4.7	$39,000	$468,000

Verdict at scale: Picking the wrong provider becomes a six-figure annual mistake. The gap between Gemini 3.0 Flash and Claude Opus is $465,360 per year for what may be a barely measurable accuracy improvement.

The Decision Framework: 5 Questions to Pick the Right Provider

Use this framework, in order, to narrow your provider choice. Skip questions that do not apply to your workload.

Question 1: What is your actual accuracy requirement?

Mission-critical (medical, legal, financial advice): Claude Opus 4.7 or GPT-5. The accuracy gap matters. Cost is secondary.
Quality matters but is not life-or-death (customer support, code review, content generation): Claude Sonnet 4.6, GPT-5, or Gemini 3.0 Pro. Test all three on your specific task.
Volume processing where 95% accuracy is acceptable (classification, extraction, summarization): Claude Haiku 4.5, GPT-5-mini, or Gemini 3.0 Flash. Always prefer the cheaper tier here.
Pure throughput tasks (sentiment analysis, language detection, keyword extraction): Nova Lite or Gemini 3.0 Flash.

Question 2: What is your latency requirement?

Sub-second responses required: Gemini 3.0 Flash, GPT-5-mini, Claude Haiku 4.5. Avoid flagship models.
2-5 seconds is acceptable: Any flagship model works.
Async/batch (5+ seconds OK): Use the Batch API for 50% cost cut. This is the single biggest LLM cost optimization most teams miss.

Question 3: What is your token volume?

Under 10M tokens/month: Cost differences are noise. Pick on quality.
10M-500M tokens/month: Provider choice matters significantly. Match tier to workload accuracy needs.
500M-5B tokens/month: Mix multiple providers. Use the cheapest tier that meets accuracy on each task type.
Over 5B tokens/month: Self-hosting starts to make sense for stable, high-volume tasks. Keep API providers for spiky/variable traffic.

Question 4: What is your context window need?

Under 8K tokens (most chat): Any provider. Smaller context = lower cost.
8K-128K (RAG, agents, document Q&A): GPT-5, Claude Haiku 4.5, or Gemini Flash all work.
128K-200K (long document analysis): Claude family or Gemini.
Over 200K (massive document corpora, codebase analysis): Gemini 3.0 Pro is the only practical option (1M context).

Question 5: What are your compliance/data residency requirements?

No specific requirements: Direct API providers (cheapest, fewest hops).
HIPAA, SOC 2, FedRAMP needed: AWS Bedrock or Azure OpenAI. Marginal cost premium, big compliance benefit.
Data must stay in EU: Anthropic EU endpoints, Azure OpenAI EU regions, or Bedrock in eu-* regions.
Air-gapped environments: Self-host Llama, DeepSeek, or Mistral. No API option works.

Self-Hosting: When the Math Actually Works

We get asked this constantly: "Should we self-host Llama instead of paying OpenAI/Anthropic?" Most of the time the answer is no, because teams underestimate the true total cost.

True Cost of Self-Hosting (Often Hidden)

Cost Item	Annual Cost (Estimate)
4x A100 GPU instance (24/7, 70% utilization)	$80,000-$120,000
Idle GPU time during low traffic	$10,000-$25,000
ML engineer dedicated to inference (0.5 FTE)	$90,000-$130,000
Observability stack (vLLM metrics, tracing)	$5,000-$15,000
Model weight storage and rotation	$1,000-$3,000
Failover/redundancy GPUs	$40,000-$80,000
Total	$226,000-$373,000/year

For that money, you could buy approximately 283 billion Claude Haiku 4.5 input tokens (at $0.80/1M) or 3.7 trillion Gemini 3.0 Flash input tokens. Most teams do not have anywhere near that volume.

Self-Hosting Breakeven Calculation

The breakeven point for self-hosting Llama 4 70B versus Claude Haiku 4.5 on equivalent volume is:

Approximately 500 million tokens per day, at 70%+ GPU utilization, sustained 365 days/year.

Below that threshold, you almost always lose money self-hosting. Above it, self-hosting can save 40-70% if you keep utilization high.

When Self-Hosting Does Make Sense

Air-gapped/regulated environments where API access is forbidden
Custom fine-tuning beyond what API providers support (rare in 2026 since most support fine-tuning)
Predictable, sustained, high-volume workloads at 500M+ tokens/day
Latency-critical edge inference where GPU colocation matters
Specific open-source model capabilities (e.g., specialized vision models, niche reasoning models)

For everything else, API providers are cheaper, simpler, and faster to scale.

The Batch API: 50% Off, Used by 23% of Teams

OpenAI, Anthropic, and Google all offer Batch APIs that process requests asynchronously within 24 hours and return a 50% discount on both input and output tokens.

Workloads That Are Almost Always Batchable

Data enrichment pipelines (extract structured fields from documents)
Content moderation queues (review user-generated content)
Evaluation pipelines (run tests against new model versions)
Embedding generation (process new documents into vector databases)
Translation queues (translate large content libraries)
Summarization backlogs (process old transcripts/articles)
Classification runs (tag historical data)
Code analysis (lint codebases, generate documentation)

If your workload does not need a response within seconds, it should be on the Batch API. We see only about 23% of teams using batch processing despite 60%+ of their workloads being batchable. This is the single biggest LLM cost optimization that does not require switching providers.

Batch vs Real-Time Cost Example

For 100M tokens/month split 60/40 input/output:

Mode	GPT-5 Cost	Claude Sonnet 4.6 Cost
Real-time	$475/month	$780/month
Batch (50% off)	$238/month	$390/month
Annual savings	$2,844	$4,680

For free. Just by routing async workloads to a different endpoint.

Hidden Cost Traps Most Teams Fall Into

Trap 1: Paying Output Token Cost on Verbose Models

Some models default to longer responses. Claude Opus and Sonnet are particularly verbose by default. Verbosity is your bill multiplier. For every extra "Sure, I can help with that..." preamble, you pay for those tokens.

Fix: Add explicit instructions in your system prompt: "Respond concisely. Do not include preambles, acknowledgments, or summaries unless asked."

Trap 2: Re-Sending Massive Context on Every Call

If you are sending 50K tokens of system prompt + retrieval context with every user query, you are paying input cost on those 50K tokens repeatedly. Multiply by 1000 conversations and you have just paid for 50M tokens of repeated input.

Fix: Use Prompt Caching (Anthropic, OpenAI, and Bedrock all support it). Cached input tokens cost 10-25% of normal input cost. For RAG with stable retrieval contexts, this can cut input costs by 75%.

Trap 3: Using Flagship Models for Routing/Classification

Many agentic systems use a flagship model (Claude Opus 4.7, GPT-5) for the initial routing/intent classification step. This is wildly overkill. A workhorse model can route with 99%+ accuracy at 10-30x lower cost.

Fix: Build a tier system. Use Gemini 3.0 Flash or Claude Haiku 4.5 for routing/classification. Only escalate to the flagship model for the final reasoning step.

Trap 4: Not Tracking Cost Per Customer

Most teams know their total LLM bill. Few know which customers, features, or workflows drive that bill. Without per-customer cost tracking, you cannot identify abusive usage, optimize the worst offenders, or build accurate unit economics.

Fix: Tag every LLM call with a customer ID, feature ID, and workflow ID. Aggregate weekly. The 80/20 rule almost always applies: 20% of customers/features drive 80% of cost. Optimize those first.

Trap 5: Ignoring Rate Limit Headers

When you hit rate limits, retries pile up, queues back up, and tokens get wasted on failed requests. Every 429 retry that eventually succeeds was a wasted forward pass that you do not pay for, but tier limits force you onto a more expensive provider tier to handle the load.

Fix: Implement exponential backoff with jitter, monitor your rate limit headers, and proactively distribute traffic across multiple providers/regions before you hit hard limits.

A 30-Day LLM Cost Optimization Playbook

If your LLM bill is over $5,000/month, run this playbook. We typically see 40-70% savings within the first month.

Week 1: Visibility

Tag every LLM call with customer/feature/workflow metadata
Set up cost dashboards by model, by feature, by team
Identify your top 10 cost drivers (which are almost certainly 80% of your bill)
Audit which models each feature uses today

Week 2: Quick Wins

Move all batchable workloads to Batch API (50% off, no other changes)
Enable prompt caching where stable system prompts exist (75% savings on cached portions)
Add concise-response instructions to verbose models (cuts output tokens 20-40%)
Cap max_tokens on every API call (prevents runaway responses)

Week 3: Tier Migrations

Identify which features actually need flagship-tier accuracy (usually 20-30% of calls)
Migrate routing/classification/extraction to Haiku 4.5, GPT-5-mini, or Gemini Flash
Run quality benchmarks on the migrations to confirm accuracy is maintained
Lock in the cheaper model in production

Week 4: Architectural Optimization

Build a tier system (cheap model for routing, expensive for reasoning)
Implement aggressive prompt compression (remove redundancy, use abbreviations)
Cache stable retrieval contexts at the application layer
Set up cost guardrails and alerts (per-customer rate limits, daily spend caps)

When To Use Each Provider (Cheat Sheet)

Workload	Best Provider	Why
Customer support chatbot (high volume)	Claude Haiku 4.5 or GPT-5-mini	Cheap, accurate, fast
Coding assistant / agentic IDE	Claude Sonnet 4.6 or Opus 4.7	Best at code + tool use
Long document Q&A (200K+ tokens)	Gemini 3.0 Pro	Largest context, cheap input
RAG over docs/wiki	Claude Haiku 4.5 with prompt caching	Cached input is 10x cheaper
Content moderation (batch)	Gemini 3.0 Flash via Batch API	Cheapest tier + 50% batch discount
Data extraction pipelines	DeepSeek V3.1 on Bedrock	Strong accuracy at $0.14/1M input
Real-time translation	Gemini 3.0 Flash	Sub-second latency, low cost
Mission-critical reasoning	Claude Opus 4.7	Highest accuracy, accept the cost
Air-gapped/regulated	Self-hosted Llama 4 or Mistral Large 3	Only option without API access
Multimodal (vision + text)	GPT-5 or Gemini 3.0 Pro	Best vision capabilities

The Bottom Line

LLM API costs in 2026 vary by 250x between the cheapest and most expensive options. The choice of provider/model is not a one-time decision; it is a continuously optimizable cost lever, and the teams that ignore it pay 3-7x more than necessary for the same outcome.

The single biggest action item: if you are not using Batch APIs and prompt caching, you are leaving 50-75% savings on the table for free. Start there before you change providers.

If your LLM bill is growing faster than your AI features and you are not sure where the leak is, our cloud cost optimization team runs a free LLM cost audit. We typically find 40-60% savings within 30 days, and we have benchmarked production workloads across all five providers above. Run a free Cloud Waste Scorecard to see where your AI infrastructure leaks are.

Further reading:

Frequently Asked Questions

Stop Overpaying for Cloud Infrastructure

Our clients save 30-60% on their cloud bill within 90 days. Get a free Cloud Waste Assessment and see exactly where your money is going.

Free Cloud Waste Assessment Our Services

Related Insights

View All

Cloud Cost Optimization

May 14, 2026

Agentic AI Cost Runaway: Why One Cursor User Burned $4,200 in a Weekend (And How to Stop It)

Agentic AI tools (Claude Code, Cursor, Cline, autonomous agents) consume tokens 10-100x faster than chatbots because every reasoning step, tool call, and context refresh burns its own token budget. We audited 30 engineering teams and found the average agentic developer costs $400-$1,500/month in API fees. Here is the cost runaway pattern and the playbook to control it.

Cloud Cost Optimization

May 14, 2026

Best Cloud Storage 2026: The Decision Framework Most 'Top 10' Lists Get Wrong

Most 'best cloud storage 2026' lists rank providers by storage price per GB, which is the single least important factor for total cost. We migrated over 2 petabytes across S3, R2, B2, Wasabi, GCS, and Azure Blob, and the right answer depends on egress patterns, request volume, and ecosystem lock-in. This is the workload-to-provider decision framework based on real production migrations.

Cloud Cost Optimization

May 13, 2026

Cast AI vs Kubecost vs nOps: Which Kubernetes Cost Tool Actually Saves Money? (2026 Comparison)

We deployed Cast AI, Kubecost, and nOps on identical Kubernetes clusters to measure real savings. The results were not what vendor marketing promises. One tool delivered 58% savings automatically, one provided visibility but required manual action, and one added cost without proportional value. This is the decision framework based on actual production data.

View All Insights