Someone on Your Team Just Burned $12,000 in AWS and Has No Idea
You open your email Monday morning. There is an AWS billing alert. Your weekly spend jumped $12,000 above normal. Finance is already asking questions. Your CTO wants answers by noon.
You log into Cost Explorer. You see the spike. It started Wednesday afternoon. But Cost Explorer shows you totals by service, not causes. EC2 is up. Data transfer is up. S3 is up. Everything is up. You start clicking through accounts, regions, and services, trying to piece together what happened.
Three hours later, you have narrowed it down to "something in us-east-1 related to EC2 and NAT gateway." That is not an answer. That is a geography lesson.
This is the reality for most AWS teams. They can see that costs spiked. They cannot quickly trace the spike to a specific resource, deployment, or team. And while they are investigating, the spike is still running, burning another $1,700 per day.
I have spent years helping teams build systems that prevent this exact scenario. This post gives you the 7 tactics that let you trace any AWS cost spike to its root cause in under 30 minutes and, more importantly, prevent most spikes from happening at all.
The 5 AWS Services That Cause 80% of Cost Spikes
Before we get into detection and prevention, you need to know where to look first. In our experience, five AWS services are responsible for roughly 80% of all unexpected cost increases. When your bill spikes, check these first and in this order:
1. EC2 (Including Auto Scaling Groups)
EC2 is the number one cause of cost spikes for one simple reason: auto scaling groups can launch instances faster than humans can notice.
A misconfigured scaling policy that sets the minimum capacity too high, responds to the wrong metric, or fails to scale down after a traffic spike can leave dozens of expensive instances running indefinitely. One team we know had a scaling policy triggered by CPU utilization on a single instance. When that instance had a temporary CPU spike from a cron job, the ASG launched 40 additional instances. Nobody noticed for 5 days. Cost: $8,400.
Where to look: Check Auto Scaling Group activity history. Look for scaling events that were not matched by corresponding scale-down events. Check for instances in "running" state that were launched by ASGs but are not receiving traffic.
2. NAT Gateway
NAT Gateway is the stealth killer of AWS budgets. It charges $0.045/hour per gateway (about $32/month just for existing) plus $0.045/GB for every byte of data processed through it.
Most teams have no idea how much data flows through their NAT gateways. A single misconfigured application pulling data from the internet through NAT instead of using a VPC endpoint can process terabytes of data per month. The NAT Gateway processing charge alone on 10TB is $450.
Where to look: In Cost Explorer, filter by "NAT Gateway" under EC2-Other. If this number is surprising, use VPC Flow Logs to identify which instances are sending traffic through the NAT gateway and how much.
For a deep dive into this specific problem, read our guide on the hidden AWS bill: NAT gateways and AI workloads.
3. Data Transfer
AWS data transfer pricing is genuinely confusing, and that confusion costs teams thousands.
Here are the rules most people get wrong:
- Data transfer into AWS is free
- Data transfer between Availability Zones in the same region costs $0.01/GB each way ($0.02 round trip)
- Data transfer between regions costs $0.02/GB
- Data transfer to the internet costs $0.09/GB (first 10TB), then $0.085/GB
- Data transfer between services in the same AZ using private IPs is free, but using public IPs costs $0.01/GB
That cross-AZ charge is the one that kills microservices architectures. If you have 20 services communicating across AZs and each sends 1GB/hour to each other, that is 400GB/hour x $0.02 = $8/hour = $5,760/month just in inter-service communication.
Where to look: In Cost Explorer, group by "Usage Type" and filter for anything containing "DataTransfer." Sort by cost. The top entries will tell you exactly where your transfer costs are concentrated.
Our detailed breakdown of data transfer and egress costs covers every scenario.
4. S3
S3 cost spikes usually come from one of three places: request charges, storage class mismatches, or accidental versioning accumulation.
Request charges hit teams that make millions of small API calls. A GET request costs $0.0004 per 1,000 requests. Sounds cheap until your application makes 500 million GET requests per month. That is $200/month just in request fees.
Storage class mismatches happen when lifecycle policies are not set up. Every object stays in S3 Standard ($0.023/GB) even though 90% of it has not been accessed in months and should be in Intelligent-Tiering or Infrequent Access ($0.0125/GB).
Versioning accumulation is the silent one. S3 versioning keeps every version of every overwritten object. If your application overwrites objects frequently (think log aggregation or cache storage), versions accumulate silently. Your bucket appears to hold 500GB but actually stores 5TB when you include all versions.
Where to look: Enable S3 Storage Lens for a dashboard that shows storage class distribution, request patterns, and cost efficiency metrics across all your buckets.
5. RDS and Database Services
RDS cost spikes happen when Multi-AZ failovers create unexpected charges, when automated backups accumulate beyond the free tier, or when dev/test databases run on production-sized instances.
The most common pattern: someone creates an RDS instance for testing, chooses db.r6g.2xlarge "because the production config uses it," enables Multi-AZ "just in case," and forgets about it. Cost: $1,460/month for a database nobody queries.
Where to look: List all RDS instances and check their CloudWatch metrics. Any instance with average CPU below 10% and fewer than 100 connections per day is a candidate for downsizing or termination.
Tactic 1: Set Up AWS Cost Anomaly Detection in 15 Minutes
This is the single most important thing you can do, and it is completely free.
AWS Cost Anomaly Detection uses machine learning to identify unusual spend patterns in your account. It runs continuously and alerts you when something looks off.
Here is exactly how to set it up:
- Go to AWS Cost Management in the console
- Click Cost Anomaly Detection in the left sidebar
- Click Create monitor
- Choose AWS services as the monitor type (this covers all services)
- Set the alert threshold to $100 daily impact (adjust based on your spend)
- Create an SNS topic and subscribe your Slack webhook or team email
- Click Create
That is it. In 15 minutes, you have machine learning monitoring your entire AWS bill and alerting you when something abnormal happens. The model learns your spending patterns over 2 weeks and then starts flagging anomalies.
Pro tip: Create a second monitor with a lower threshold ($50) for your non-production accounts. Cost spikes in dev and staging often indicate misconfigurations that will eventually hit production.
Tactic 2: Build a 30-Minute Spike Investigation Playbook
When a cost spike alert arrives, you need a repeatable process that gets you to the root cause fast. Not a vague "investigate the spike." A step-by-step playbook that anyone on the team can follow.
Here is the playbook we use:
Minutes 0-5: Scope the Spike
Open Cost Explorer. Set the time range to the last 7 days with daily granularity. Group by Service. Identify which service or services spiked.
Then switch to Linked Account grouping if you use AWS Organizations. This tells you which account the spike occurred in.
Minutes 5-10: Narrow to Region and Usage Type
Within the spiking service, filter by Region. Then group by Usage Type. Usage types are far more specific than service names. For EC2, you will see entries like USW2-BoxUsage:c5.4xlarge which tells you the exact instance type and region.
Minutes 10-20: Identify the Resource
For EC2 spikes: Go to the EC2 console in the identified region. Sort instances by launch time. Look for instances launched around when the spike started. Check their tags to identify the owner and purpose.
For data transfer spikes: Use AWS Cost and Usage Reports (CUR) with Athena to query transfer details. Filter by line_item_usage_type containing "DataTransfer" and sort by cost.
For S3 spikes: Check S3 Storage Lens or CloudWatch metrics for the buckets in the affected region. Look for spikes in request count or storage size.
Minutes 20-30: Confirm Root Cause and Remediate
By now you should know the specific resource, region, and usage type causing the spike. Confirm by checking:
- Recent deployments or Terraform applies in that account
- Auto Scaling Group activity logs
- CloudTrail for API calls that created or modified resources
- CI/CD pipeline runs that might have triggered workloads
Then take immediate action: terminate, downsize, or reconfigure the responsible resource.
The move: Write this playbook into a Confluence page or runbook. Share it with your entire engineering team. When the next spike hits, anyone can investigate it, not just the one person who "knows where to look."
Tactic 3: Implement Granular Resource Tagging (For Real This Time)
You have heard "tag your resources" a thousand times. Here is why most teams still fail at it and what to do differently.
The problem is not that teams do not know about tagging. The problem is that tagging is optional by default. When it is optional, it does not happen consistently. And inconsistent tagging is almost as useless as no tagging.
Make Tagging Mandatory With AWS Service Control Policies
AWS Organizations Service Control Policies (SCPs) can require specific tags on resource creation. If a resource does not have the required tags, the API call fails and the resource is not created.
Here is an example SCP that requires team, environment, and cost-center tags on all EC2 instances:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireTagsOnEC2",
"Effect": "Deny",
"Action": ["ec2:RunInstances"],
"Resource": ["arn:aws:ec2:*:*:instance/*"],
"Condition": {
"Null": {
"aws:RequestTag/team": "true",
"aws:RequestTag/environment": "true",
"aws:RequestTag/cost-center": "true"
}
}
}
]
}
Apply this to your development and staging OUs first. Let teams adapt. Then roll it out to production.
The Minimum Viable Tag Set
You do not need 20 tags. You need 5 that are always present:
| Tag Key | Purpose | Example |
|---|---|---|
team | Who owns this resource | platform, ml-team, backend |
environment | Lifecycle stage | prod, staging, dev, sandbox |
cost-center | Financial attribution | eng-platform, eng-ml |
service | Application or microservice | payment-api, recommendation-engine |
managed-by | How it was created | terraform, cdk, manual |
The managed-by tag is the one most teams skip, and it is incredibly valuable during spike investigation. If a resource was created manually (not through IaC), it is far more likely to be misconfigured or forgotten.
Tactic 4: Set Service-Level Budget Alarms (Not Just Account-Level)
Most teams set one budget alarm for their entire AWS account. "Alert me if total spend exceeds $50,000/month." This is nearly useless for catching spikes because your total spend has to increase dramatically before the alarm triggers.
Service-level budgets are far more sensitive. If your EC2 spend is normally $15,000/month and suddenly hits $18,000 by day 15, that 20% increase is a clear signal. But if your total account spend is $80,000 and increases to $83,000, that 3.75% change might not even trigger your account-level alarm.
How to Set It Up
Create AWS Budgets for your top 5 services by cost:
- EC2 (including EBS and data transfer)
- RDS (including backup storage)
- S3 (including request fees)
- Lambda (including provisioned concurrency)
- Data Transfer (cross-AZ, cross-region, internet)
For each budget, set three alert thresholds:
- 50% of monthly budget reached by the 15th of the month (early warning)
- 75% of monthly budget reached by the 22nd (escalation)
- 100% of monthly budget reached at any point (immediate action)
Also set daily budgets for services with high spike risk. A daily EC2 budget of $600 that triggers at 120% ($720) will catch a scaling misconfiguration on the same day it happens.
Tactic 5: Use Cost Allocation Reports to Connect Spikes to Deployments
Here is a tactic that most teams overlook entirely, and it is one of the most powerful for preventing recurring spikes.
AWS Cost and Usage Reports (CUR) contain line-item detail for every charge in your account. When you load CUR data into Athena or a data warehouse, you can query it with SQL to answer specific questions like:
- "Which EC2 instances launched between March 10 and March 12 cost more than $100?"
- "What was the total data transfer cost from the payment-api service last week?"
- "Which S3 buckets had more than 10 million requests this month?"
But the real power comes when you correlate CUR data with your deployment history.
The Deployment-Cost Correlation
Export your CI/CD deployment timestamps (from GitHub Actions, Jenkins, or whatever you use). Join them with CUR data by time range and tagged service name. This gives you a view that shows exactly which deployments caused cost increases.
When you can say "the deployment of payment-api v2.3.1 on March 11 at 2:14pm increased daily EC2 costs by $340 due to a new autoscaling policy," you have transformed cost management from reactive to preventive. The team that deployed that change can review the scaling configuration and fix it before the next release.
The move: Set up CUR export to S3 and create an Athena table. Even if you do not build the full correlation pipeline immediately, having the data available means you can query it ad-hoc during your next spike investigation.
Tactic 6: Automate the Fixes That Humans Keep Forgetting
Every spike investigation ends with someone saying "we should set up automation for that." And then they do not, because there is always something more urgent. Until the same spike happens again next month.
Here are the four automations that prevent the most common AWS cost spikes:
Auto-Stop Dev Environments
Use AWS Instance Scheduler to stop all dev and staging EC2 and RDS instances outside business hours. This prevents the "dev instance running all weekend" spike that hits almost every team.
Estimated savings: 65% of non-production compute costs.
Auto-Delete Orphaned Resources
Write a Lambda function triggered by a daily EventBridge schedule that finds and deletes:
- Unattached EBS volumes older than 7 days
- Unused Elastic IPs
- Empty load balancers with zero healthy targets
- Snapshots older than 90 days without a
do-not-deletetag
Estimated savings: 5% to 10% of total spend from resource cleanup alone.
Auto-Enforce S3 Intelligent-Tiering
Set a default bucket policy that applies S3 Intelligent-Tiering to all new objects. This moves data between access tiers automatically based on usage patterns, eliminating storage class mismatches without any manual intervention.
Auto-Alert on NAT Gateway Traffic Spikes
Create a CloudWatch alarm on the BytesOutToDestination metric for each NAT Gateway. Set the threshold at 150% of your 7-day average. When triggered, the alarm posts to Slack with the NAT Gateway ID and the estimated daily cost, giving your team immediate visibility into data transfer spikes.
For more on building automated cloud operations, explore our Automated Cloud Operations service.
Tactic 7: Run a Monthly "Cost Spike Retrospective"
Prevention is better than detection. And the best way to prevent future spikes is to learn from past ones.
At the end of each month, run a 30-minute retrospective that reviews every cost anomaly from the previous month. For each anomaly, document:
- What spiked: Service, region, account, and approximate cost impact
- When it started: Date and time the anomaly began
- How long it ran: Hours or days before detection and remediation
- Root cause: What specifically caused the cost increase
- Detection method: How was it discovered (alert, manual review, finance report)
- Time to resolution: How long from detection to remediation
- Prevention action: What automation, policy, or process change will prevent recurrence
After three months of retrospectives, you will have a clear picture of your most common spike patterns. Most teams discover that 80% of their spikes come from just 2 to 3 root causes. Fix those root causes, and your monthly bill becomes dramatically more predictable.
The move: Schedule the first retrospective for the end of this month. Invite engineering, ops, and one finance partner. Make it a standing meeting. The cost insights compound over time.
For expert help building your entire FinOps practice, explore our Cloud Cost Optimization and FinOps service.
The Quick Reference: AWS Cost Spike Investigation Cheat Sheet
Bookmark this for your next spike:
| Symptom | First Place to Look | Common Cause | Quick Fix |
|---|---|---|---|
| EC2 cost spike | ASG activity history | Scaling misconfiguration | Adjust min/max capacity, fix scaling metric |
| Data transfer spike | VPC Flow Logs, NAT Gateway metrics | Cross-AZ traffic, missing VPC endpoints | Add VPC endpoints, optimize service placement |
| S3 cost spike | S3 Storage Lens, CloudWatch requests metric | Request volume, versioning bloat | Enable Intelligent-Tiering, set version lifecycle |
| RDS cost spike | RDS instance list, CloudWatch CPU | Oversized dev/test instances | Downsize or terminate idle instances |
| Lambda cost spike | Lambda metrics dashboard | Recursive invocations, high memory config | Fix recursion, right-size memory allocation |
| NAT Gateway spike | NAT Gateway CloudWatch metrics | Application routing through NAT unnecessarily | Create VPC endpoints for S3, DynamoDB, etc. |
Your AWS Bill Does Not Have to Be a Surprise
Here is the thing about AWS cost spikes. They feel random but they are not. Every spike has a cause. Every cause has a pattern. And every pattern has a prevention.
The teams that get control of their AWS costs are not the ones with the biggest budgets or the most sophisticated tools. They are the ones who built a system: anomaly detection that catches problems in hours, a playbook that traces root causes in minutes, tagging that makes attribution instant, and automation that prevents the most common mistakes from recurring.
You can build that system in a month. Start with anomaly detection this week. Add the investigation playbook next week. Enforce tagging the week after. Set up one automation the week after that. Four weeks from now, your AWS bill will be the most predictable number in your entire business.
Want to find all the waste hiding in your AWS account right now? Take our free Cloud Waste and Risk Scorecard for a personalized assessment in under 5 minutes.
Related reading:
- Why Is My AWS Bill Suddenly So High? The Complete Technical Playbook
- The Hidden AWS Bill: NAT Gateways and AI Workloads
- The Real Cost of Data Transfer: NAT Gateways, Egress Fees, and Hidden Bill Killers
- Real-Time Cloud Cost Optimization: 7 Strategies to Prevent Spend Spikes
- AWS Sprawl Is Silently Eating 30-40% of Your Cloud Budget
External resources: