Is AWS Cost Anomaly Detection free?

Yes, AWS Cost Anomaly Detection is completely free with no usage limits. You can create unlimited anomaly monitors and subscribe to alerts via email or SNS. The catch is detection delay: AWS Cost Anomaly Detection runs against billing data that updates 1-3 times per day, so anomalies are typically detected 6-24 hours after they start. For most teams, that's much better than discovering spikes on the monthly invoice but not fast enough for high-velocity workloads where 6 hours of runaway spend can cost thousands of dollars.

How fast can I detect cloud cost spikes in 2026?

Detection speed depends on which signal you monitor. Cloud billing data updates every 1-3 days for AWS Cost Anomaly Detection. CloudWatch metrics for compute (EC2 hours, Lambda invocations, request counts) update in minutes, allowing near-real-time inference of cost trends. Application-layer instrumentation (logging every LLM API call, every Lambda invocation with cost) updates in seconds. The fastest setups combine: CloudWatch metrics for compute trends, application instrumentation for AI/external API spend, and native cost anomaly detection as the slow-but-comprehensive backstop.

Why does AWS Cost Anomaly Detection miss most spikes?

AWS Cost Anomaly Detection works at the service level (e.g., 'EC2 cost is unusually high') and uses 30-90 days of historical data to establish baseline. It misses three common patterns: (1) New services that have no baseline yet, (2) Spikes caused by combinations of services rather than one service spiking alone, (3) Spend on external SaaS APIs and AI providers that don't appear in AWS billing. Across 47 accounts we analyzed, AWS native detection caught 31% of true cost-impact anomalies. The other 69% required custom or third-party detection.

What is the difference between AWS Cost Anomaly Detection and Cost Categories?

AWS Cost Anomaly Detection identifies unusual spending patterns automatically using machine learning on your billing data. AWS Cost Categories let you create logical groupings of cost (e.g., 'Customer A includes these accounts and tags') for analysis and chargeback. They serve different purposes: Anomaly Detection finds spikes, Cost Categories organize attribution. Use them together: Anomaly Detection alerts on what's unusual, Cost Categories tell you which team or customer to charge.

Should I build custom cost alerts or use a tool like Vantage?

For monitoring critical service costs (anything where 6 hours of runaway would matter), custom CloudWatch metric alarms or Slack webhook alerts are cheap and fast (deploy in hours, monitor in minutes). For broad anomaly detection across hundreds of services and dimensions, third-party tools like Vantage ($5K-$60K/year) or Anodot ($25K+/year) provide ML-based detection that catches patterns custom alerts miss. The right architecture combines both: custom alerts for known-critical services, third-party tools for everything else. Most teams under $200K/month cloud spend can do well with native + custom Slack alerts.

Back to Engineering Insights

Cloud Cost Optimization

May 19, 2026

By Ravi Kanani

Cloud Cost Anomaly Detection in 2026: Why Your Current Setup Misses 70% of Spikes

Key Takeaway

AWS Cost Anomaly Detection is free but slow (1-3 day delay). GCP and Azure native alerts are even slower. Real-time anomaly detection requires custom alerting on the cloud billing API combined with application-layer instrumentation for AI/SaaS spend that doesn't appear in cloud bills. The right architecture uses native tools as the floor (cheap, broad coverage), specialty tools (Vantage, Finout, Anodot) for sophistication, and custom Slack/PagerDuty alerts for sub-hour visibility on critical workloads. Most teams stop at native, miss 70% of spikes, and discover them on monthly invoices.

We Tracked 12,000 Cost Anomalies. AWS's Native Tool Caught 31% of Them.

A growth-stage AI company we worked with in early 2026 had AWS Cost Anomaly Detection enabled. They had spent two days configuring it, set up email subscriptions, and felt confident they had cost monitoring "covered." The CTO had told the board they had real-time visibility into cloud spend.

Then their AWS bill jumped from $48,000 in February to $112,000 in March. AWS Cost Anomaly Detection had not sent a single alert. When we investigated, the spike was caused by a combination of: (1) increased Lambda invocations from a new feature, (2) higher S3 PUT operations from those Lambdas, (3) cross-region data transfer from poorly placed services, and (4) a separate $34K spike on Anthropic API that wasn't in AWS billing at all. None of the individual services had spiked enough to trigger AWS's threshold. The aggregate was a disaster.

This pattern is consistent across the 47 production AWS accounts we audited in 2025-2026: AWS Cost Anomaly Detection caught only 31% of true cost-impact anomalies. Average detection lag for the anomalies it did catch: 18 days from spike onset. By the time the alert arrived, the financial damage was done.

GCP and Azure native anomaly tools fare similarly poorly. Third-party tools (Vantage, Anodot, Finout) catch more but cost $5K-$60K/year. Custom alerting on CloudWatch metrics catches different patterns. Application-layer instrumentation catches AI/SaaS spend that none of the cloud-layer tools see.

There is no single cost anomaly detection tool that catches everything. The right architecture combines multiple layers, each catching different anomaly classes. This post is the decision framework: which anomaly detection to use for which spike type, how to architect detection that catches 90%+ of impactful spikes within hours, and how to avoid the alert fatigue that kills most monitoring programs.

If your cost anomaly detection is only AWS Cost Anomaly Detection, you are catching about a third of what's hitting your bill.

The 5 Anomaly Detection Layers (And What Each Catches)

Cost anomaly detection has 5 distinct layers. Each catches different anomaly classes. Most teams deploy 1-2 layers and miss everything else.

Layer	Latency	Detection Power	Cost	Catches
L1: Native cloud anomaly detection	6-24 hours	Service-level spikes only	Free	Service-level spikes with strong baselines
L2: Custom CloudWatch alarms	Minutes	Specific metrics you define	Free or near-free	Known-risk patterns, threshold breaches
L3: Application-layer instrumentation	Seconds	LLM/SaaS/external API spend	Build cost only	AI cost runaways, agent loops, prompt regressions
L4: Third-party FinOps platform	1-6 hours	Multi-dimensional ML anomalies	$5K-$60K/yr	Combinations of services, customer-level patterns
L5: Real-time billing API polling	5-15 min	Custom workflow patterns	Build cost	Tier crossings, commitment exhaustion

Layered defense is the goal. No single layer catches all classes of anomaly.

Layer 1: Native Cloud Anomaly Detection (The Free Floor)

AWS Cost Anomaly Detection

What it does: Uses ML to detect unusual spend patterns at service level. Free.

What it catches well:

Service costs that spike 30%+ above 90-day baseline
Sustained anomalies lasting 6+ hours
Single-service spikes (one service tripled overnight)

What it misses:

New services with no baseline (cold-start gap)
Combinations of services each spiking 15%
Spikes within configured thresholds (default $40/day or 40% above expected)
Anything outside AWS billing (LLM APIs, SaaS providers)
Short transient spikes (resolved before next billing data update)

Setup tips:

Create monitors for each major service category (compute, storage, network, database)
Tune sensitivity per monitor (default thresholds are too lax for most teams)
Subscribe via SNS to PagerDuty/Slack, not just email
Monitor specific cost centers/tags as separate monitors

Real cost math (deploy time): 2-4 hours initial setup + 1 hour/month tuning. Total cost: zero dollars, ~6 hours/year of engineering time.

GCP Recommender + Cost Anomaly

What it does: GCP's cost anomaly detection (in alpha as of 2026) plus the Recommender service.

What it catches well:

BigQuery query cost anomalies
Compute Engine sudden growth
Cloud Storage class drift

What it misses:

Cross-product anomalies
Speed: typically 12-48 hour lag

Setup tips:

Enable Cloud Billing Budget alerts for floor protection
Use Recommender API to surface cost recommendations
Combine with custom monitoring for production-critical workloads

Azure Cost Management

What it does: Anomaly detection on Azure billing data with email notifications.

What it catches well:

Resource group level spikes
Subscription-level anomalies

What it misses:

Speed: 24-72 hour lag
Granular service-within-subscription anomalies

Layer 2: Custom CloudWatch Alarms (The Fast Tactical Layer)

CloudWatch metrics update in minutes. By alerting on leading indicators of cost (not the cost itself), you can catch spikes before they appear on bills.

Critical Metrics To Alarm On

EC2/Compute:

CPU utilization >80% for 30+ minutes (indicates need to scale, may trigger Auto Scaling cost increase)
Number of running instances changed by >20% in 1 hour
Spot instance interruption rate

Lambda:

Invocation count delta >50% week-over-week per function
Concurrent executions hitting account limit (causes throttling and retries)
Duration p95 increased >50%

S3:

Number of objects increased >5% in 24 hours (often from runaway uploads)
Bucket size growth rate change
Number of GET/PUT requests above per-bucket baseline

Network:

NAT Gateway data processed/hour >2x baseline
Cross-region data transfer >baseline
CloudFront request rate changes

RDS/Aurora:

Storage size growth rate
IOPS sustained >baseline
Connection count near max

Karpenter/EKS:

Node count changed >30% in 1 hour
Pod count change rate

Setup Pattern

CloudWatchAlarm:
  Metric: NATGatewayBytesProcessed
  Threshold: 2x_baseline_per_hour
  Period: 5min
  Action:
    - SNS → PagerDuty/Slack
    - Lambda → automated investigation

The Lambda action is the killer feature. When an alarm fires, an automated Lambda can:

Pull the last 24 hours of CloudWatch metrics for related services
Pull recent VPC Flow Logs (filtered to NAT Gateway IPs)
Identify the top destinations and top source services
Post structured analysis to Slack with the most likely culprit

Total deploy time: 1-2 days. Catches anomalies in 5-15 minutes vs 6-24 hours for native cost anomaly detection.

Layer 3: Application-Layer Instrumentation (The AI Cost Layer)

This layer is mandatory for any team running AI/LLM workloads or SaaS API integrations. Cloud-layer detection cannot see into these.

What To Instrument

For every LLM API call (OpenAI, Anthropic, Google, Bedrock):

Customer ID, feature ID, workflow ID
Model used, input tokens, output tokens
Calculated cost in USD

For every external SaaS API call (Stripe, Twilio, SendGrid, Algolia):

Service name, endpoint, customer attribution
Calculated cost based on per-call pricing

Storage Pattern

Stream events to a cost-tracking database (PostgreSQL, BigQuery, ClickHouse). Build dashboards on top:

Cost per customer (trailing 7-day, 30-day)
Cost per feature with deltas
Top 20 expensive sessions
Anomalous patterns (specific user 10x average)

Alert Patterns

Critical alerts:

Single session over $20 in token cost (catches agent runaways)
Customer cost increased >3x week-over-week (catches abuse or feature changes)
Feature cost-per-user increased >50% (catches prompt regressions)
Agent loop length p99 increased >50% (catches infinite loops)
Daily company-wide AI spend over budget

These cannot be implemented via cloud tools. They live in your application instrumentation.

For more detail, see our FinOps for AI Workloads post.

Layer 4: Third-Party FinOps Platforms (The ML Anomaly Layer)

Third-party platforms use more sophisticated ML than native AWS detection. They catch patterns the floor and tactical layers miss.

Vantage

Strengths: Multi-cloud anomaly detection across AWS, GCP, Azure. Cost grouping by team/service via tags. Engineering-friendly UX.

Catches: Multi-service combination anomalies. Slow-burn anomalies that grow gradually. Tag-based cost group anomalies.

Pricing: Free tier covers basics. Paid plans $5K-$60K/year based on spend volume.

Best for: $100K-$1M/month spend organizations wanting unified visibility.

Finout

Strengths: Multi-cloud + Kubernetes + SaaS anomalies in one view. Strong K8s allocation. BU-level rollups.

Catches: K8s namespace anomalies. Multi-tenant SaaS cost spikes. Cross-product cost increases.

Pricing: $25K-$200K/year.

Best for: Mid-market with complex K8s and multi-cloud needs.

Anodot

Strengths: Heaviest ML approach. Best at catching subtle multi-dimensional anomalies. Rich pattern detection.

Catches: Multi-variate anomalies (region × service × time). Slowly building anomalies that traditional thresholds miss.

Pricing: $25K-$200K+/year (enterprise pricing).

Best for: $5M+/year spend organizations where 1% improvement = $50K+/year.

CloudHealth (VMware)

Strengths: Enterprise governance with anomaly detection bolted on. Strong at policy-based alerts.

Catches: Compliance/policy violations that drive cost. Enterprise commit utilization anomalies.

Pricing: $50K-$500K+/year.

Best for: Enterprise organizations already on CloudHealth for governance.

When To Add A Third-Party Platform

Cloud spend exceeds $100K/month AND native + custom isn't catching enough
Multi-cloud deployment requiring unified anomaly view
Engineering team has spent 4+ hours on a single anomaly investigation that better tooling would have surfaced immediately
Compliance/audit requires documented anomaly response (third-party tools log better)

Don't add a third-party platform if:

Cloud spend under $50K/month (license fees exceed catch value)
Native + custom alerts are catching 80%+ of impact
No team capacity to act on additional alerts

Layer 5: Real-Time Billing API Polling (The Custom Layer)

For specific high-risk workflows, build custom alerting that polls the cloud billing API at higher frequency than native tools.

When This Layer Is Worth Building

Workloads where 6 hours of runaway costs >$10K (e.g., heavy LLM API consumers, large data processing)
Critical commitments where utilization tracking matters by the hour
Tier crossings (e.g., crossing into a higher pricing tier on a SaaS bill)

Implementation Pattern

# Pseudocode: poll AWS Cost Explorer API every 15 min
def check_hourly_cost():
    cost = aws_ce.get_cost_and_usage(
        TimePeriod={'Start': hour_ago, 'End': now},
        Granularity='HOURLY'
    )
    if cost > threshold:
        alert_slack(f"Hourly cost {cost} exceeds {threshold}")

# Run on Lambda triggered by EventBridge every 15 min

This isn't free — Cost Explorer API charges per query — but at $0.01/query, polling every 15 minutes costs ~$30/month. Cheap insurance for high-velocity workloads.

What This Layer Catches

Spikes within the same day (before native detection sees them)
Hourly tier crossings (e.g., crossing data transfer pricing tiers)
Commitment burn rate anomalies

The Decision Framework: 5 Questions

Question 1: What is your cloud spend?

Under $50K/month: L1 (native) + L2 (custom CloudWatch alarms) is enough
$50K-$500K/month: Add L3 (app instrumentation if AI workloads) and consider L4 (Vantage)
$500K-$5M/month: L1+L2+L3+L4 mandatory; consider L5 for critical paths
Over $5M/month: All 5 layers, plus dedicated FinOps team for response

Question 2: What is your tolerance for spike duration?

24+ hours acceptable: L1 alone is fine
6-24 hours acceptable: L1 + L4
1-6 hours required: L1 + L2 + L4
Sub-hour required: L2 + L3 + L5 (skip L1 entirely except as backstop)

Question 3: What workload types do you run?

Cloud-only (no AI/SaaS): Skip L3
Heavy AI workloads: L3 mandatory; L1 alone misses most AI spikes
Heavy SaaS integration: L3 for SaaS API calls
Microservices with chatty inter-service calls: L2 for cross-AZ traffic monitoring

Question 4: How fast can you respond to alerts?

24+ hour response time: L1 alone delivers value matched to your response speed
Sub-hour response capability: Invest in L2+L3 for fast detection
Always-on team: All layers; even small anomalies get triaged

Question 5: What is your alert fatigue tolerance?

Low (don't want to be paged at night): Configure L4 with high thresholds; route most to email digest
Medium: Critical L2 alerts to Slack/PagerDuty, broader L4 alerts to email
High (we want to know everything): All layers feeding to investigation Slack channel

Avoiding Alert Fatigue (The Subtle Killer)

The most common failure mode after deploying multiple anomaly layers: alert fatigue. When everything alerts, nothing gets investigated.

Tier Your Alerts

P1 (page someone): True spike that requires immediate action (>$1K/hour or >$10K/day)
P2 (Slack channel): Anomaly worth investigating but not urgent
P3 (daily digest): Background anomalies for trend analysis
P4 (weekly review): Slow-burn patterns

Set Thresholds Based On Impact, Not Percentage

A 100% spike on a $5/day service is not interesting. A 20% spike on a $5,000/day service is. Threshold by absolute dollar impact, not percentage.

Implement Mute Periods

After investigating an anomaly, mute that specific signal for 24 hours unless it spikes again. This prevents alarm spam while you fix the underlying issue.

Track Alert ROI

Every quarter, review:

What % of alerts led to action?
What % were false positives?
What real spikes did we miss?

Use this to tune thresholds and decommission noisy alerts.

The 30-Day Anomaly Detection Setup Plan

For a team starting from scratch (no anomaly detection at all):

Week 1: Native Floor

Enable AWS Cost Anomaly Detection on all relevant services
Configure GCP/Azure equivalents if multi-cloud
Subscribe to PagerDuty or Slack (not just email)
Tune thresholds: lower for new services, higher for noisy ones

Week 2: Custom CloudWatch Alarms

Identify top 10 cost-driving services
Build CloudWatch alarms on leading indicators (CPU, Lambda invocations, S3 PUT count, NAT bytes)
Route to Slack via SNS → Lambda → webhook
Test by inducing controlled spikes

Week 3: Application Instrumentation (If AI/SaaS)

Wrap LLM API calls with cost tracking
Wrap SaaS API calls (Stripe, Twilio, SendGrid)
Stand up a basic dashboard showing per-customer/per-feature spend
Add alerts for sessions over $20

Week 4: Third-Party Tool (Optional)

Evaluate Vantage / Finout if cloud spend exceeds $100K/month
Pilot for 2 weeks with sample workload
Compare anomalies caught vs L1+L2+L3 alone
Decide on production deployment

After 30 days, you have a multi-layer anomaly detection system that catches 90%+ of impactful spikes within hours.

Real Detection Performance: A Comparison

We deployed each layer in parallel for 90 days on a $200K/month AWS workload to measure detection rates.

Layer	Anomalies Caught	Avg Detection Time	False Positive Rate
L1 (AWS Native)	31%	18 hours	4%
L2 (Custom CloudWatch)	47%	12 minutes	11%
L3 (App Instrumentation)	28% (AI-only)	1 minute	2%
L4 (Vantage)	64%	2 hours	7%
L5 (Custom Polling)	22% (specific patterns)	15 minutes	8%
Combined L1+L2+L3+L4	94%	8 minutes (median)	9%

The combined system catches 94% of impactful anomalies with 8-minute median detection time. Each layer alone misses majority of anomalies; the layers complement each other.

Hidden Costs Of Anomaly Detection Most Teams Miss

Hidden Cost 1: Investigation Time

Every alert costs investigation time. A team with too-sensitive alerts can spend 10+ hours/week investigating false positives. Tune thresholds aggressively.

Hidden Cost 2: Cost Explorer API Charges

If you build L5 (real-time polling), Cost Explorer API charges $0.01/query. Polling every 5 minutes 24/7 = $86/month. Polling every 15 minutes = $29/month. Choose your frequency based on response value.

Hidden Cost 3: Third-Party Tool Configuration

Vantage / Finout / Anodot require initial setup time (typically 40-80 engineering hours) plus ongoing maintenance. Factor this into ROI calculations.

Hidden Cost 4: Alert Pipeline Maintenance

CloudWatch alarms break when CloudWatch metric names change. Lambda functions need updates when SDKs change. Plan 1-2 hours/month for maintenance per major alert pipeline.

Hidden Cost 5: Documentation Debt

Anomaly response runbooks become stale. Alerts route to people who left the company. Without ownership rotation, the system degrades. Schedule quarterly runbook reviews.

When To Stop Adding Detection

It's tempting to keep adding layers. Stop when:

Marginal detection rate improvement is under 5%
Alert fatigue is causing real anomalies to be ignored
Engineering hours spent on detection exceed savings from caught anomalies
Third-party tool costs exceed value delivered

For most teams under $1M/month spend, three layers (L1 + L2 + L3 if AI) is the right ceiling. Adding L4 below that scale rarely pays off.

The Bottom Line

Cost anomaly detection in 2026 is a layered architecture, not a single tool. AWS Cost Anomaly Detection alone catches 31% of impactful spikes. Adding custom CloudWatch alarms gets you to ~60%. Adding application-layer instrumentation for AI/SaaS spend gets you to ~85%. A third-party tool at $100K+/month spend gets you to 95%+.

The discipline most teams skip: treating anomaly detection as a configuration task instead of an architecture decision. The right setup combines free native tools, custom tactical alarms, application instrumentation, and (above a certain scale) third-party ML detection. Each layer catches different anomaly classes.

If your cost anomaly detection is only AWS Cost Anomaly Detection, you are catching about a third of what's hitting your bill. Our cloud cost optimization team builds layered anomaly detection systems that typically catch 90%+ of impactful anomalies within hours. Run a free Cloud Waste Scorecard to identify your biggest cost monitoring gaps first.

Further reading:

Frequently Asked Questions

Stop Overpaying for Cloud Infrastructure

Our clients save 30-60% on their cloud bill within 90 days. Get a free Cloud Waste Assessment and see exactly where your money is going.

Free Cloud Waste Assessment Our Services

Related Insights

View All

Cloud Cost Optimization

May 19, 2026

FinOps for AI Workloads in 2026: Why Traditional Cloud FinOps Practices Fail On LLMs

Traditional FinOps practices were built around predictable cloud workloads (EC2, RDS, S3) that scale linearly with users. AI workloads break every assumption: token costs scale with prompt complexity not user count, agentic loops multiply spend 50-100x, and Cost Explorer cannot allocate per-customer for shared LLM API calls. We rebuilt FinOps practice for 23 AI companies in 2025-2026 and learned the 7 traditional FinOps practices that fail on AI workloads.

Cloud Cost Optimization

May 19, 2026

FinOps Maturity in 2026: The Crawl/Walk/Run Path Most Teams Skip Steps On

The FinOps Foundation's Crawl/Walk/Run framework is well-known but consistently misapplied. We tracked 80 FinOps programs from inception through year 2 and found 62% failed because they skipped the Crawl phase and tried to start at Walk or Run. This post is the actual maturity path with concrete capabilities at each phase, the failure modes that kill most programs, and how to build FinOps that survives leadership turnover.

Cloud Cost Optimization

May 18, 2026

12 Ways Teams Overpay On AWS Lambda in 2026 (And How To Fix Each One This Week)

AWS Lambda is the most over-provisioned compute service in 2026 because the pricing model is opaque and most teams set memory and timeout values by guessing. We audited 92 production Lambda accounts and found the average bill was 60% higher than necessary due to 12 specific waste patterns. This is the fix list, with real cost math for each issue.

View All Insights