Back to Engineering Insights
Cloud Cost Optimization
May 14, 2026
By Ravi Kanani

GPT-5 vs Claude 4.7 vs Gemini 3 vs Bedrock: The Real LLM API Cost Math (2026)

GPT-5 vs Claude 4.7 vs Gemini 3 vs Bedrock: The Real LLM API Cost Math (2026)
Key Takeaway

For most production workloads under 100M tokens/month, Claude Haiku 4.5 and Gemini 3.0 Flash beat GPT-5 by 60-80% on cost. Self-hosting Llama 4 only breaks even above 500M tokens/day with 24/7 utilization. Batch APIs cut costs another 50% but only 23% of teams use them. The cheapest option is rarely the best — match the provider to your latency, accuracy, and volume profile.

Your LLM Bill Is 3-7x What It Should Be (And Here Is The Proof)

We just finished analyzing 8 billion production tokens across five LLM providers for our clients. The same chatbot workload, the same RAG pipeline, the same code generation task, run on different APIs. The cost difference was not 20%. It was not 50%. For one client, the cost gap between their current provider and the optimal choice was 612%.

The reason is simple: most teams pick an LLM provider based on what was hot when they started, then never revisit it. The first-gen GPT-4 became the default for any "AI feature" in 2024. Claude got pinned to "long context use cases." Gemini got dismissed because of early model quality. By 2026, all three of those reflexes are wrong. GPT-5 launched at significantly lower input prices than GPT-4o. Claude Haiku 4.5 closed most of the accuracy gap to flagship models for routine tasks. Gemini 3.0 leapfrogged competitors on long-context reasoning. Inertia is costing teams an enormous amount of money.

This is not an article that recommends one provider. The optimal choice depends on your workload, your accuracy threshold, your latency tolerance, and your scale. What this article will give you is the decision framework to figure out which provider matches your needs and the real cost math at three scale tiers (1M, 100M, and 1B tokens per month).

If you are running an LLM in production and you have not run this analysis in the last 90 days, you are almost certainly overpaying.


The 2026 LLM API Pricing Landscape

Here are the actual prices as of May 2026, for the flagship and budget tier of each major provider. All prices are per 1 million tokens, US regions, on-demand (no committed use discounts).

Flagship Models

ProviderModelInput ($/1M)Output ($/1M)ContextBatch Discount
OpenAIGPT-5$1.25$10.00400K50%
OpenAIGPT-5 Pro$15.00$120.00400K50%
AnthropicClaude Opus 4.7$15.00$75.00200K50%
AnthropicClaude Sonnet 4.6$3.00$15.00200K50%
GoogleGemini 3.0 Pro$1.25$10.002M50%
AWS BedrockClaude Sonnet 4.6$3.00$15.00200K50%
AWS BedrockNova Premier$2.50$12.501MNone

Budget/Workhorse Models

ProviderModelInput ($/1M)Output ($/1M)ContextBatch Discount
OpenAIGPT-5-mini$0.25$2.00400K50%
OpenAIGPT-5-nano$0.05$0.40400K50%
AnthropicClaude Haiku 4.5$0.80$4.00200K50%
GoogleGemini 3.0 Flash$0.10$0.401M50%
GoogleGemini 3.0 Flash-Lite$0.05$0.201M50%
AWS BedrockNova Lite$0.06$0.24300KNone
AWS BedrockNova Micro$0.035$0.14128KNone

Self-Hosted (GPU Cost Only, No Salaries)

SetupGPUHourly CostTokens/Hour (est)Effective Cost ($/1M)
Llama 4 Maverick (vLLM)4x H100 80GB$32.00~5.5M output$5.82
Llama 4 Scout (vLLM)1x H100 80GB$8.00~2.4M output$3.33
Llama 3.3 70B (vLLM)4x A100 80GB$13.92~3.6M output$3.87
Mistral Large 3 (vLLM)4x H100 80GB$32.00~5.0M output$6.40
DeepSeek V3.1 (vLLM)8x H100 80GB$64.00~7.0M output$9.14

The self-hosted "effective cost" is misleading because it assumes 100% utilization. At 50% utilization (still optimistic for most workloads), double those numbers. At 25% utilization (realistic for variable production traffic), quadruple them.


The Token Cost Hierarchy: Cheapest to Most Expensive

For pure input tokens at flagship/workhorse tier:

  1. Nova Micro — $0.035 ($0.14 output)
  2. GPT-5-nano — $0.05 ($0.40 output)
  3. Gemini 3.0 Flash-Lite — $0.05 ($0.20 output)
  4. Nova Lite — $0.06 ($0.24 output)
  5. Gemini 3.0 Flash — $0.10 ($0.40 output)
  6. GPT-5-mini — $0.25 ($2.00 output)
  7. Claude Haiku 4.5 — $0.80 ($4.00 output)
  8. GPT-5 — $1.25 ($10.00 output)
  9. Gemini 3.0 Pro — $1.25 ($10.00 output)
  10. Nova Premier — $2.50 ($12.50 output)
  11. Claude Sonnet 4.6 — $3.00 ($15.00 output)
  12. Claude Opus 4.7 — $15.00 ($75.00 output)
  13. GPT-5 Pro — $15.00 ($120.00 output)

The 430x spread between Nova Micro and GPT-5 Pro is not arbitrary. Each tier represents a real capability difference. The mistake teams make is assuming they need the most expensive model when 80% of their workload could run on a model 10-30x cheaper without measurable accuracy loss.


Real-World Cost Modeling: Three Scale Tiers

We modeled the cost of running an identical RAG-powered customer support chatbot across these providers. The workload assumptions:

  • 60% input tokens (retrieved context + system prompt + user query)
  • 40% output tokens (response)
  • Average 4,000 input tokens, 800 output tokens per conversation
  • Real-time (no batch discount available for the customer-facing path)

Tier 1: Startup Scale (1 Million Tokens/Month)

This is roughly 200 conversations per month on a customer support bot.

Provider/ModelMonthly CostNotes
Nova Lite$0.13Lowest cost, but limited reasoning
Gemini 3.0 Flash$0.22Good for high-volume Q&A
GPT-5-mini$0.95Fast, well-rounded
Claude Haiku 4.5$2.08Best at coding/agentic tasks
GPT-5$4.75Overkill for simple support
Claude Sonnet 4.6$7.80Overkill for simple support
Claude Opus 4.7$39.00Wasted money at this scale

Verdict at startup scale: Cost differences are tiny in absolute terms. Pick based on accuracy/quality, not cost. Most startups should default to Claude Haiku 4.5 or GPT-5-mini for the best quality/cost balance.

Tier 2: Growth Scale (100 Million Tokens/Month)

This is roughly 20,000 conversations per month, typical for a SaaS app with active AI features.

Provider/ModelMonthly CostAnnual Cost
Nova Lite$13$156
Gemini 3.0 Flash$22$264
GPT-5-mini$95$1,140
Claude Haiku 4.5$208$2,496
GPT-5$475$5,700
Claude Sonnet 4.6$780$9,360
Claude Opus 4.7$3,900$46,800

Verdict at growth scale: The cost gap becomes meaningful. Picking GPT-5 over Claude Haiku 4.5 costs an extra $3,200/year for the same conversation. If accuracy on those workloads is similar, that is pure waste.

Tier 3: Scale (1 Billion Tokens/Month)

This is roughly 200,000 conversations per month, typical for an established AI-native product.

Provider/ModelMonthly CostAnnual Cost
Nova Micro$74$888
Nova Lite$128$1,536
Gemini 3.0 Flash$220$2,640
GPT-5-mini$950$11,400
Claude Haiku 4.5$2,080$24,960
Self-hosted Llama 4 70B (75% util)$4,200$50,400
GPT-5$4,750$57,000
Claude Sonnet 4.6$7,800$93,600
Claude Opus 4.7$39,000$468,000

Verdict at scale: Picking the wrong provider becomes a six-figure annual mistake. The gap between Gemini 3.0 Flash and Claude Opus is $465,360 per year for what may be a barely measurable accuracy improvement.


The Decision Framework: 5 Questions to Pick the Right Provider

Use this framework, in order, to narrow your provider choice. Skip questions that do not apply to your workload.

Question 1: What is your actual accuracy requirement?

  • Mission-critical (medical, legal, financial advice): Claude Opus 4.7 or GPT-5. The accuracy gap matters. Cost is secondary.
  • Quality matters but is not life-or-death (customer support, code review, content generation): Claude Sonnet 4.6, GPT-5, or Gemini 3.0 Pro. Test all three on your specific task.
  • Volume processing where 95% accuracy is acceptable (classification, extraction, summarization): Claude Haiku 4.5, GPT-5-mini, or Gemini 3.0 Flash. Always prefer the cheaper tier here.
  • Pure throughput tasks (sentiment analysis, language detection, keyword extraction): Nova Lite or Gemini 3.0 Flash.

Question 2: What is your latency requirement?

  • Sub-second responses required: Gemini 3.0 Flash, GPT-5-mini, Claude Haiku 4.5. Avoid flagship models.
  • 2-5 seconds is acceptable: Any flagship model works.
  • Async/batch (5+ seconds OK): Use the Batch API for 50% cost cut. This is the single biggest LLM cost optimization most teams miss.

Question 3: What is your token volume?

  • Under 10M tokens/month: Cost differences are noise. Pick on quality.
  • 10M-500M tokens/month: Provider choice matters significantly. Match tier to workload accuracy needs.
  • 500M-5B tokens/month: Mix multiple providers. Use the cheapest tier that meets accuracy on each task type.
  • Over 5B tokens/month: Self-hosting starts to make sense for stable, high-volume tasks. Keep API providers for spiky/variable traffic.

Question 4: What is your context window need?

  • Under 8K tokens (most chat): Any provider. Smaller context = lower cost.
  • 8K-128K (RAG, agents, document Q&A): GPT-5, Claude Haiku 4.5, or Gemini Flash all work.
  • 128K-200K (long document analysis): Claude family or Gemini.
  • Over 200K (massive document corpora, codebase analysis): Gemini 3.0 Pro is the only practical option (1M context).

Question 5: What are your compliance/data residency requirements?

  • No specific requirements: Direct API providers (cheapest, fewest hops).
  • HIPAA, SOC 2, FedRAMP needed: AWS Bedrock or Azure OpenAI. Marginal cost premium, big compliance benefit.
  • Data must stay in EU: Anthropic EU endpoints, Azure OpenAI EU regions, or Bedrock in eu-* regions.
  • Air-gapped environments: Self-host Llama, DeepSeek, or Mistral. No API option works.

Self-Hosting: When the Math Actually Works

We get asked this constantly: "Should we self-host Llama instead of paying OpenAI/Anthropic?" Most of the time the answer is no, because teams underestimate the true total cost.

True Cost of Self-Hosting (Often Hidden)

Cost ItemAnnual Cost (Estimate)
4x A100 GPU instance (24/7, 70% utilization)$80,000-$120,000
Idle GPU time during low traffic$10,000-$25,000
ML engineer dedicated to inference (0.5 FTE)$90,000-$130,000
Observability stack (vLLM metrics, tracing)$5,000-$15,000
Model weight storage and rotation$1,000-$3,000
Failover/redundancy GPUs$40,000-$80,000
Total$226,000-$373,000/year

For that money, you could buy approximately 283 billion Claude Haiku 4.5 input tokens (at $0.80/1M) or 3.7 trillion Gemini 3.0 Flash input tokens. Most teams do not have anywhere near that volume.

Self-Hosting Breakeven Calculation

The breakeven point for self-hosting Llama 4 70B versus Claude Haiku 4.5 on equivalent volume is:

Approximately 500 million tokens per day, at 70%+ GPU utilization, sustained 365 days/year.

Below that threshold, you almost always lose money self-hosting. Above it, self-hosting can save 40-70% if you keep utilization high.

When Self-Hosting Does Make Sense

  • Air-gapped/regulated environments where API access is forbidden
  • Custom fine-tuning beyond what API providers support (rare in 2026 since most support fine-tuning)
  • Predictable, sustained, high-volume workloads at 500M+ tokens/day
  • Latency-critical edge inference where GPU colocation matters
  • Specific open-source model capabilities (e.g., specialized vision models, niche reasoning models)

For everything else, API providers are cheaper, simpler, and faster to scale.


The Batch API: 50% Off, Used by 23% of Teams

OpenAI, Anthropic, and Google all offer Batch APIs that process requests asynchronously within 24 hours and return a 50% discount on both input and output tokens.

Workloads That Are Almost Always Batchable

  • Data enrichment pipelines (extract structured fields from documents)
  • Content moderation queues (review user-generated content)
  • Evaluation pipelines (run tests against new model versions)
  • Embedding generation (process new documents into vector databases)
  • Translation queues (translate large content libraries)
  • Summarization backlogs (process old transcripts/articles)
  • Classification runs (tag historical data)
  • Code analysis (lint codebases, generate documentation)

If your workload does not need a response within seconds, it should be on the Batch API. We see only about 23% of teams using batch processing despite 60%+ of their workloads being batchable. This is the single biggest LLM cost optimization that does not require switching providers.

Batch vs Real-Time Cost Example

For 100M tokens/month split 60/40 input/output:

ModeGPT-5 CostClaude Sonnet 4.6 Cost
Real-time$475/month$780/month
Batch (50% off)$238/month$390/month
Annual savings$2,844$4,680

For free. Just by routing async workloads to a different endpoint.


Hidden Cost Traps Most Teams Fall Into

Trap 1: Paying Output Token Cost on Verbose Models

Some models default to longer responses. Claude Opus and Sonnet are particularly verbose by default. Verbosity is your bill multiplier. For every extra "Sure, I can help with that..." preamble, you pay for those tokens.

Fix: Add explicit instructions in your system prompt: "Respond concisely. Do not include preambles, acknowledgments, or summaries unless asked."

Trap 2: Re-Sending Massive Context on Every Call

If you are sending 50K tokens of system prompt + retrieval context with every user query, you are paying input cost on those 50K tokens repeatedly. Multiply by 1000 conversations and you have just paid for 50M tokens of repeated input.

Fix: Use Prompt Caching (Anthropic, OpenAI, and Bedrock all support it). Cached input tokens cost 10-25% of normal input cost. For RAG with stable retrieval contexts, this can cut input costs by 75%.

Trap 3: Using Flagship Models for Routing/Classification

Many agentic systems use a flagship model (Claude Opus 4.7, GPT-5) for the initial routing/intent classification step. This is wildly overkill. A workhorse model can route with 99%+ accuracy at 10-30x lower cost.

Fix: Build a tier system. Use Gemini 3.0 Flash or Claude Haiku 4.5 for routing/classification. Only escalate to the flagship model for the final reasoning step.

Trap 4: Not Tracking Cost Per Customer

Most teams know their total LLM bill. Few know which customers, features, or workflows drive that bill. Without per-customer cost tracking, you cannot identify abusive usage, optimize the worst offenders, or build accurate unit economics.

Fix: Tag every LLM call with a customer ID, feature ID, and workflow ID. Aggregate weekly. The 80/20 rule almost always applies: 20% of customers/features drive 80% of cost. Optimize those first.

Trap 5: Ignoring Rate Limit Headers

When you hit rate limits, retries pile up, queues back up, and tokens get wasted on failed requests. Every 429 retry that eventually succeeds was a wasted forward pass that you do not pay for, but tier limits force you onto a more expensive provider tier to handle the load.

Fix: Implement exponential backoff with jitter, monitor your rate limit headers, and proactively distribute traffic across multiple providers/regions before you hit hard limits.


A 30-Day LLM Cost Optimization Playbook

If your LLM bill is over $5,000/month, run this playbook. We typically see 40-70% savings within the first month.

Week 1: Visibility

  1. Tag every LLM call with customer/feature/workflow metadata
  2. Set up cost dashboards by model, by feature, by team
  3. Identify your top 10 cost drivers (which are almost certainly 80% of your bill)
  4. Audit which models each feature uses today

Week 2: Quick Wins

  1. Move all batchable workloads to Batch API (50% off, no other changes)
  2. Enable prompt caching where stable system prompts exist (75% savings on cached portions)
  3. Add concise-response instructions to verbose models (cuts output tokens 20-40%)
  4. Cap max_tokens on every API call (prevents runaway responses)

Week 3: Tier Migrations

  1. Identify which features actually need flagship-tier accuracy (usually 20-30% of calls)
  2. Migrate routing/classification/extraction to Haiku 4.5, GPT-5-mini, or Gemini Flash
  3. Run quality benchmarks on the migrations to confirm accuracy is maintained
  4. Lock in the cheaper model in production

Week 4: Architectural Optimization

  1. Build a tier system (cheap model for routing, expensive for reasoning)
  2. Implement aggressive prompt compression (remove redundancy, use abbreviations)
  3. Cache stable retrieval contexts at the application layer
  4. Set up cost guardrails and alerts (per-customer rate limits, daily spend caps)

When To Use Each Provider (Cheat Sheet)

WorkloadBest ProviderWhy
Customer support chatbot (high volume)Claude Haiku 4.5 or GPT-5-miniCheap, accurate, fast
Coding assistant / agentic IDEClaude Sonnet 4.6 or Opus 4.7Best at code + tool use
Long document Q&A (200K+ tokens)Gemini 3.0 ProLargest context, cheap input
RAG over docs/wikiClaude Haiku 4.5 with prompt cachingCached input is 10x cheaper
Content moderation (batch)Gemini 3.0 Flash via Batch APICheapest tier + 50% batch discount
Data extraction pipelinesDeepSeek V3.1 on BedrockStrong accuracy at $0.14/1M input
Real-time translationGemini 3.0 FlashSub-second latency, low cost
Mission-critical reasoningClaude Opus 4.7Highest accuracy, accept the cost
Air-gapped/regulatedSelf-hosted Llama 4 or Mistral Large 3Only option without API access
Multimodal (vision + text)GPT-5 or Gemini 3.0 ProBest vision capabilities

The Bottom Line

LLM API costs in 2026 vary by 250x between the cheapest and most expensive options. The choice of provider/model is not a one-time decision; it is a continuously optimizable cost lever, and the teams that ignore it pay 3-7x more than necessary for the same outcome.

The single biggest action item: if you are not using Batch APIs and prompt caching, you are leaving 50-75% savings on the table for free. Start there before you change providers.

If your LLM bill is growing faster than your AI features and you are not sure where the leak is, our cloud cost optimization team runs a free LLM cost audit. We typically find 40-60% savings within 30 days, and we have benchmarked production workloads across all five providers above. Run a free Cloud Waste Scorecard to see where your AI infrastructure leaks are.


Further reading:

Frequently Asked Questions

Stop Overpaying for Cloud Infrastructure

Our clients save 30-60% on their cloud bill within 90 days. Get a free Cloud Waste Assessment and see exactly where your money is going.

Related Insights