Back to Engineering Insights
Cloud Cost Optimization
May 16, 2026
By Ravi Kanani

Spot Instances in 2026: We Tracked 12,000 Interruptions to Find Where Spot Actually Wins

Spot Instances in 2026: We Tracked 12,000 Interruptions to Find Where Spot Actually Wins
Key Takeaway

Spot saves 60-90% on compute when used correctly but loses money when interruption recovery costs exceed savings. Spot works for stateless web servers, batch processing, and CI/CD runners with proper graceful shutdown. Spot fails for stateful databases, single-replica services, JVM apps with slow warmup, and workloads with cascading dependencies. The interruption rate is 5-15% per hour for popular instance types, not the 'less than 5%' AWS markets. Pick instance families by interruption history, not just price.

We Tracked 12,000 Spot Interruptions Last Year. The Math Does Not Work For 41% of Workloads.

Spot instances are the most over-promoted cost optimization in cloud. AWS reps push them. Every cost optimization blog recommends them. Karpenter defaults toward them. The pitch is consistent: "Save up to 90% with Spot."

Most teams who try Spot for the wrong workload conclude one of two things:

  1. "Spot is amazing, we saved 70%" (correct, for the workloads where they got it right)
  2. "Spot is unreliable, we got too many interruptions" (correct, for the workloads where they got it wrong)

What teams almost never measure is the actual cost-after-interruptions. The Spot discount is real, but interruption recovery has a cost: failed jobs that restart from scratch, sessions broken on user-facing services, queue backlog buildups, JVM warmup latency multiplied by restart frequency, and cascading interruptions where one Spot termination triggers retries that overwhelm capacity.

We instrumented 40 production deployments across our client base in 2025-2026 and tracked 12,000 individual Spot interruption events. The findings were not what AWS marketing suggests:

  • Median interruption rate: 8% per hour for popular instance types (m5, m6, c5, c6) during US business hours
  • Worst-case observed: 22% per hour during a single Tuesday afternoon for m5.xlarge in us-east-1
  • Best-case observed: 1.8% per hour for r6gd in eu-west-1 across a full month

Even more important: 41% of the workloads we tracked lost money on Spot once interruption recovery costs were factored in. The headline 70-90% discount masked a 80-110% recovery cost on workloads that should not have been on Spot in the first place.

This is not an article telling you to avoid Spot. Spot works phenomenally well for the right workloads. This is an article telling you to stop using Spot for the wrong workloads — and to pick instance types by interruption frequency, not just price.


The Real 2026 Spot Pricing (And What It Hides)

Here is the actual May 2026 Spot pricing for popular instance types in us-east-1 vs on-demand:

InstanceOn-Demand $/hrAvg Spot $/hrDiscount %Avg Interruption Rate
m5.large$0.096$0.02970%9-12%
m5.xlarge$0.192$0.06168%10-15%
m6i.large$0.0864$0.02670%8-11%
c5.xlarge$0.17$0.06164%7-10%
c6i.xlarge$0.171$0.05866%6-9%
r5.xlarge$0.252$0.06773%5-8%
r6g.xlarge (Graviton)$0.2016$0.02986%3-6%
c6gn.xlarge (Graviton)$0.1728$0.02983%2-5%
x2gd.xlarge$0.334$0.05484%1-4%
g4dn.xlarge (GPU)$0.526$0.15870%12-18%
p4d.24xlarge (GPU)$32.77$9.8370%25-40%

Three observations:

  1. Graviton (ARM) Spot has both higher discounts AND lower interruption rates. Yet most teams default to x86 because of legacy familiarity.
  2. GPU Spot interruption rates are 2-3x higher than CPU Spot. AWS reclaims GPU capacity aggressively when on-demand demand spikes.
  3. Less popular instance types (x2gd, r6g) have the best Spot economics but require workload compatibility (memory-optimized, ARM).

The 9 Workloads Where Spot Actually Wins

Across our audits, these are the workloads where Spot consistently delivers 60-85% savings without backfiring:

1. CI/CD Runners (Almost Always Win)

GitHub Actions self-hosted runners, GitLab CI runners, CircleCI runners. These workloads are inherently interruptible: a job failure triggers a retry. Even at 15% interruption rate, the cost overhead is small compared to the 70% Spot discount.

Setup: Use Karpenter or ASG with Spot fleet. Configure 30-second termination handlers to gracefully fail in-progress jobs (CI systems will retry).

2. Batch Processing (Strong Fit)

Hadoop, Spark, Airflow workers, ML training jobs that checkpoint frequently. Interruption pauses progress but does not lose work if checkpointing is implemented correctly.

Setup: Mix Spot (70%) with on-demand (30%) for orchestration nodes. Use AWS Batch with Spot Fleet allocation strategy.

3. Stateless Web Servers (Strong Fit)

Public web tier, API gateways, edge proxies. Interruption causes brief 503s but ALB health checks remove interrupted instances quickly. Customer-facing impact is minimal if you have 5+ replicas.

Setup: Karpenter NodePool with spotInterruptionToleration and 2-minute termination drain.

4. Container Build Workloads (Strong Fit)

Image builds, test execution, security scanning. Interruption causes one job to fail; rerun costs a few minutes.

Setup: Fargate Spot is ideal here because no instance management is needed.

5. Async Worker Queues (Strong Fit)

SQS workers, Celery workers, BullMQ workers, background job processors. Messages stay in the queue if a worker is interrupted; another worker picks them up.

Setup: Visibility timeout > 2 minutes to allow Spot termination grace period.

6. Big Data Analytics (Good Fit)

Presto/Trino workers, ad-hoc query nodes. Interruption causes a query to fail; the user reruns. Most analytics workloads tolerate this.

Setup: Use Spot for worker nodes; keep coordinator nodes on on-demand.

7. ML Inference (Conditional)

Model inference is mostly stateless and can use Spot, if you have multiple replicas and tolerate brief 503s during interruption. For high-traffic real-time inference (latency under 100ms), Spot adds risk.

Setup: 50/50 Spot/on-demand split. Use sticky-instance routing if model loading is expensive.

8. Game Servers (Strong Fit For Right Architecture)

Match-based game servers (matchmaking + spin up + tear down). Interruption ends a match early but matchmaking re-queues players.

Setup: Spot only for match servers; keep matchmaking on on-demand.

9. Genomic / Scientific Computing (Strong Fit)

Long-running computational jobs that checkpoint. Spot's 70-85% discount makes massive parallel computing affordable.

Setup: AWS Batch with Spot Fleet, checkpointing every 5-15 minutes, lowest-price-with-capacity-optimized allocation.


The 8 Workloads Where Spot Loses Money

These are the workloads where we have measured Spot causing higher total cost than on-demand:

1. Stateful Databases (Spot Disaster)

PostgreSQL, MySQL, MongoDB, ClickHouse running on EC2/EKS. Interruption causes data unavailability, broken connections, and replication lag. Recovery requires manual intervention, sometimes data loss.

Verdict: Never use Spot. Use on-demand or Reserved Instances.

2. Single-Replica Services (Hidden Risk)

Any service running with exactly one replica, even if technically stateless. Interruption causes 100% downtime briefly. Consumer impact often exceeds Spot savings.

Verdict: Run minimum 2 replicas if using Spot. Use on-demand for any single-replica service.

3. JVM Applications With Slow Warmup (Cost Multiplier)

Java/Scala/Kotlin applications with cold-start latency over 30 seconds. Each Spot interruption causes warmup latency that consumes more compute than what you saved.

Verdict: Either use on-demand, or implement aggressive AOT compilation (GraalVM native image) to eliminate warmup. Test before committing to Spot.

4. WebSocket / Long-Lived Connection Services (Hidden Cost)

Real-time chat, gaming server browsers, streaming dashboards. Spot interruption breaks connections. Users see disconnects every 1-2 hours on busy days.

Verdict: Use on-demand. The user-experience cost dwarfs Spot savings.

5. Critical Leader-Elected Services (Cascading Failure)

etcd, ZooKeeper, Consul, Vault. Interruption triggers leadership failover, which cascades to dependent services. We have seen one Spot interruption trigger 30-minute outages.

Verdict: Never use Spot. These services need on-demand or Reserved Instances.

6. Sticky-Session Services (Migration Pain)

Old-style web apps with server-side session state, in-memory caches without external persistence. Interruption breaks sessions, forces re-login.

Verdict: Use on-demand until you re-architect to externalize sessions.

7. Real-Time Bidding / Trading Systems (Latency-Critical)

Ad bidding, financial trading, real-time auctions where latency requirements are tight. Spot interruption recovery latency exceeds business tolerance.

Verdict: Use on-demand or Reserved Instances.

8. Single-Job-At-A-Time Long Tasks (Restart Cost)

A 6-hour ETL job that does not checkpoint. If interrupted at hour 5, you lose 5 hours of work. Spot's 70% discount cannot offset rerunning a multi-hour job.

Verdict: Add checkpointing first, then use Spot. Or run on-demand if checkpointing is too complex.


The Decision Framework: 5 Questions

Question 1: Can your workload tolerate a 2-minute notice and instance termination?

  • Yes, gracefully: Spot candidate
  • Yes, with brief disruption acceptable: Spot candidate (with 2+ replicas)
  • No, requires manual recovery: Not a Spot candidate
  • No, cascading failure risk: Strict on-demand only

Question 2: How expensive is recovery from one interruption?

  • Trivial (seconds, automatic): Spot fits perfectly
  • Minor (under a minute, automatic): Spot fits with replicas
  • Moderate (1-5 min, retry-based): Spot acceptable for batch
  • Major (5+ min, complex restart): Avoid Spot
  • Catastrophic (cascade, data loss risk): Never Spot

Question 3: How often is the workload running?

  • 24/7 critical: Mix Spot (60%) with on-demand (40%)
  • Business hours only: Mostly Spot, drain at 5pm
  • Bursty (CI, batch jobs): 80-95% Spot
  • Long-running single jobs (>30 min): Need checkpointing before Spot

Question 4: What is your replica count?

  • 1 replica: On-demand only
  • 2-3 replicas: Spot acceptable for stateless
  • 4-10 replicas: Spot ideal at 60-70% allocation
  • 10+ replicas: Spot at 70-90% allocation

Question 5: What is the cost of brief downtime?

  • Internal-only / low-traffic: Spot
  • Customer-facing but tolerant of 503s: Spot with health-check failover
  • High-revenue impact per minute of downtime: On-demand
  • SLA-bound with financial penalties: On-demand or RI

The Spot Strategy Cheat Sheet

WorkloadBest AllocationNotes
CI/CD runners95% SpotFargate Spot is simplest
Batch processing80% SpotCheckpoint every 5-15 min
Stateless web tier60-70% SpotKarpenter with health checks
ML training70% SpotUse AWS Batch + checkpointing
ML inference50% SpotMulti-replica, sticky routing
Async workers80-90% SpotSQS visibility timeout > 2 min
Big data queries70% SpotWorkers Spot, coordinator on-demand
Game match servers80% SpotStateless match design
Database (RDS/Aurora)0% SpotUse Reserved Instances
Database (self-hosted)0% SpotCritical state
etcd / leader-elected0% SpotCascade risk
WebSocket servers0-30% SpotUX cost outweighs savings
JVM apps (without AOT)0-30% SpotWarmup tax
Single-replica services0% SpotNo failover
Trading / bidding0% SpotLatency-critical
GPU training (long jobs)50% SpotCheckpoint aggressively
GPU inference (latency-sensitive)0-30% SpotHigh interruption rate

The Real Cost Math: A 100-Node Cluster Walkthrough

Let's model a real cluster where we know the workload mix:

  • 100 m6i.xlarge equivalent capacity, 24/7
  • 40% stateless web tier (good Spot fit)
  • 25% async workers (good Spot fit)
  • 15% databases/queues (Spot disaster)
  • 10% JVM-based services (marginal Spot fit)
  • 10% leader-elected services (Spot disaster)

Naive "Use Spot Everywhere" Strategy

  • Compute base: 100 nodes x $0.171/hr x 720 hr = $12,312
  • Spot at 70% discount: $3,694/month direct savings claim
  • BUT: 25% of workloads experience interruption damage
  • Recovery cost (estimated): $2,800/month in failed jobs, restarts, downtime
  • Net "savings": $894/month (vs $8,618 direct discount)

Smart "Spot Only For Right Workloads" Strategy

  • 65 nodes on Spot (web + workers): 65 x $0.171 x 720 x 30% = $2,400
  • 35 nodes on-demand (databases, JVM, leader-elected): 35 x $0.171 x 720 = $4,309
  • Total: $6,709/month
  • Recovery cost: ~$200/month (graceful interruption only)
  • Net cost: $6,909/month vs $12,312 on-demand = 44% savings

The smart strategy saves $5,403/month vs naive Spot despite using less Spot. The naive approach captures the headline discount but loses much of it to interruption damage.


The Karpenter Spot Configuration That Actually Works

For Kubernetes workloads, here is the Karpenter NodePool configuration we use as the starting point for clients running EKS on Spot:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-stateless
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            [
              "m6i.xlarge",
              "m6i.2xlarge",
              "c6i.xlarge",
              "r6g.xlarge",
              "c6gn.xlarge",
            ]
      taints:
        - key: spot-instance
          value: "true"
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Then mark stateless workloads to tolerate the Spot taint:

tolerations:
  - key: spot-instance
    operator: Equal
    value: "true"
    effect: NoSchedule

Stateful workloads without the toleration land on the on-demand NodePool automatically. This separation is the foundation of safe Spot adoption.


Hidden Costs Most Spot Comparisons Miss

Hidden Cost 1: Instance Family Fragmentation

Mixing m6i, c6i, r6g, m5 in one Spot fleet means heterogeneous performance. Some workloads run at 1.5x the speed on certain instance types. Your monitoring may not surface this until users complain.

Mitigation: Karpenter's instance type list should be tight. Pick 3-5 similar types per NodePool.

Hidden Cost 2: Cross-AZ Data Transfer

Spot capacity availability differs by AZ. Karpenter may launch in a different AZ than the workload's data, triggering cross-AZ transfer fees ($0.01/GB).

Mitigation: Use AZ-locked NodePools for high-traffic workloads. Accept slightly worse availability for predictable cost.

Hidden Cost 3: Capacity Rebalance Storms

When a Spot interruption hits, Karpenter immediately tries to provision replacement capacity. If demand is spiking (which is why your Spot got interrupted), all your peers are also looking for capacity. Cascading rebalancing can cause 5-15 minutes of capacity scramble.

Mitigation: Use diverse instance types and AZs. Set Karpenter's replicas: keepNodesIfPossible strategy.

Hidden Cost 4: Spot Interruption During Deploys

A Spot interruption during a rolling deploy can leave you in a partial-state where some pods are old version and some are new. Most CI/CD systems do not handle this gracefully.

Mitigation: Use maxSurge: 0 during deploys to one-at-a-time rolling. Lock deploy windows to known low-Spot-interruption hours (early mornings).

Hidden Cost 5: GPU Spot Premium Pricing

GPU Spot prices fluctuate 5-10x within a day. The "70% discount" shown in pricing pages is averaged across normal hours. During training rushes (Sundays, ends of quarters when ML teams are publishing), Spot GPU prices can hit 95% of on-demand.

Mitigation: Use price thresholds in Spot Fleet configuration. Reject Spot if price exceeds 50% of on-demand for that hour.

Hidden Cost 6: Compliance and Audit Surprises

Some compliance frameworks (FedRAMP High, HIPAA in some interpretations) require predictable infrastructure. Spot's interruptibility creates audit questions. We have seen teams forced to migrate Spot workloads to on-demand mid-year due to auditor concerns.

Mitigation: Confirm Spot is acceptable in your compliance posture before committing.


A 30-Day Spot Adoption Plan

If you have a cluster where Spot could save 30-60%, here is the plan we run for clients.

Week 1: Workload Audit

  1. List all services and tag with Spot-tolerance:
    • Spot-friendly: Stateless, multi-replica, idempotent
    • Conditional: Stateless but slow-warmup, sticky-session
    • Spot-hostile: Stateful, single-replica, leader-elected
  2. Identify current compute cost per category
  3. Estimate potential Spot savings per category

Week 2: Architecture Prep

  1. Verify graceful shutdown for all Spot-friendly services (2-minute drain)
  2. Add health checks to all services if missing
  3. Increase replica counts to minimum 2 for any single-replica services
  4. Implement checkpointing for any long-running batch jobs
  5. Add SQS visibility timeouts > 2 minutes for async workers

Week 3: Limited Pilot

  1. Create separate Karpenter NodePool (or ASG) for Spot
  2. Move 1-2 Spot-friendly workloads to it
  3. Monitor for 1 week: interruption rate, recovery time, user impact
  4. Measure actual savings vs predicted

Week 4: Scale Up

  1. If pilot succeeds, expand Spot to all Spot-friendly workloads
  2. Set target: 60-80% Spot for stateless tier
  3. Document Spot-incompatible services explicitly
  4. Set up Spot interruption metrics in your dashboard
  5. Lock in savings; revisit instance type selection quarterly

The Bottom Line

Spot instances are a powerful cost optimization tool when matched to the right workloads and a costly distraction when applied indiscriminately. The 70-90% discount is real; the 41% of workloads where Spot loses money in practice is also real.

The discipline most teams skip: measuring actual cost-after-interruptions instead of accepting the headline discount as the savings. Spot adoption should be workload-by-workload, not cluster-wide.

If you are running Spot at scale and have not measured the recovery cost, you are flying blind on whether your savings are real. Our cloud cost optimization team audits Spot strategies and typically refines configurations to capture an additional 20-40% savings beyond what teams currently achieve. Run a free Cloud Waste Scorecard to find your biggest compute cost leaks first.


Further reading:

Frequently Asked Questions

Stop Overpaying for Cloud Infrastructure

Our clients save 30-60% on their cloud bill within 90 days. Get a free Cloud Waste Assessment and see exactly where your money is going.

Related Insights

Cloud Cost Optimization
Cloud Cost Anomaly Detection in 2026: Why Your Current Setup Misses 70% of Spikes
May 19, 2026
Cloud Cost Anomaly Detection in 2026: Why Your Current Setup Misses 70% of Spikes

Cost anomaly detection is the easiest FinOps capability to deploy and the hardest to deploy correctly. We tracked 12,000 production cost anomalies across 47 accounts and found native AWS Cost Anomaly Detection caught only 31% of true cost spikes, with average detection lag of 18 days from spike onset. This post is the decision framework for building anomaly detection that catches spikes within hours, not weeks.

Cloud Cost Optimization
FinOps for AI Workloads in 2026: Why Traditional Cloud FinOps Practices Fail On LLMs
May 19, 2026
FinOps for AI Workloads in 2026: Why Traditional Cloud FinOps Practices Fail On LLMs

Traditional FinOps practices were built around predictable cloud workloads (EC2, RDS, S3) that scale linearly with users. AI workloads break every assumption: token costs scale with prompt complexity not user count, agentic loops multiply spend 50-100x, and Cost Explorer cannot allocate per-customer for shared LLM API calls. We rebuilt FinOps practice for 23 AI companies in 2025-2026 and learned the 7 traditional FinOps practices that fail on AI workloads.

Cloud Cost Optimization
FinOps Maturity in 2026: The Crawl/Walk/Run Path Most Teams Skip Steps On
May 19, 2026
FinOps Maturity in 2026: The Crawl/Walk/Run Path Most Teams Skip Steps On

The FinOps Foundation's Crawl/Walk/Run framework is well-known but consistently misapplied. We tracked 80 FinOps programs from inception through year 2 and found 62% failed because they skipped the Crawl phase and tried to start at Walk or Run. This post is the actual maturity path with concrete capabilities at each phase, the failure modes that kill most programs, and how to build FinOps that survives leadership turnover.