Stop the Bleed: 7 Proven Tactics to Detect and P…

Someone on Your Team Just Burned $12,000 in AWS and Has No Idea

You open your email Monday morning. There is an AWS billing alert. Your weekly spend jumped $12,000 above normal. Finance is already asking questions. Your CTO wants answers by noon.

You log into Cost Explorer. You see the spike. It started Wednesday afternoon. But Cost Explorer shows you totals by service, not causes. EC2 is up. Data transfer is up. S3 is up. Everything is up. You start clicking through accounts, regions, and services, trying to piece together what happened.

Three hours later, you have narrowed it down to "something in us-east-1 related to EC2 and NAT gateway." That is not an answer. That is a geography lesson.

This is the reality for most AWS teams. They can see that costs spiked. They cannot quickly trace the spike to a specific resource, deployment, or team. And while they are investigating, the spike is still running, burning another $1,700 per day.

I have spent years helping teams build systems that prevent this exact scenario. This post gives you the 7 tactics that let you trace any AWS cost spike to its root cause in under 30 minutes and, more importantly, prevent most spikes from happening at all.

The 5 AWS Services That Cause 80% of Cost Spikes

Before we get into detection and prevention, you need to know where to look first. In our experience, five AWS services are responsible for roughly 80% of all unexpected cost increases. When your bill spikes, check these first and in this order:

1. EC2 (Including Auto Scaling Groups)

EC2 is the number one cause of cost spikes for one simple reason: auto scaling groups can launch instances faster than humans can notice.

A misconfigured scaling policy that sets the minimum capacity too high, responds to the wrong metric, or fails to scale down after a traffic spike can leave dozens of expensive instances running indefinitely. One team we know had a scaling policy triggered by CPU utilization on a single instance. When that instance had a temporary CPU spike from a cron job, the ASG launched 40 additional instances. Nobody noticed for 5 days. Cost: $8,400.

Where to look: Check Auto Scaling Group activity history. Look for scaling events that were not matched by corresponding scale-down events. Check for instances in "running" state that were launched by ASGs but are not receiving traffic.

2. NAT Gateway

NAT Gateway is the stealth killer of AWS budgets. It charges $0.045/hour per gateway (about $32/month just for existing) plus $0.045/GB for every byte of data processed through it.

Most teams have no idea how much data flows through their NAT gateways. A single misconfigured application pulling data from the internet through NAT instead of using a VPC endpoint can process terabytes of data per month. The NAT Gateway processing charge alone on 10TB is $450.

Where to look: In Cost Explorer, filter by "NAT Gateway" under EC2-Other. If this number is surprising, use VPC Flow Logs to identify which instances are sending traffic through the NAT gateway and how much.

For a deep dive into this specific problem, read our guide on the hidden AWS bill: NAT gateways and AI workloads.

3. Data Transfer

AWS data transfer pricing is genuinely confusing, and that confusion costs teams thousands.

Here are the rules most people get wrong:

Data transfer into AWS is free
Data transfer between Availability Zones in the same region costs $0.01/GB each way ($0.02 round trip)
Data transfer between regions costs $0.02/GB
Data transfer to the internet costs $0.09/GB (first 10TB), then $0.085/GB
Data transfer between services in the same AZ using private IPs is free, but using public IPs costs $0.01/GB

That cross-AZ charge is the one that kills microservices architectures. If you have 20 services communicating across AZs and each sends 1GB/hour to each other, that is 400GB/hour x $0.02 = $8/hour = $5,760/month just in inter-service communication.

Where to look: In Cost Explorer, group by "Usage Type" and filter for anything containing "DataTransfer." Sort by cost. The top entries will tell you exactly where your transfer costs are concentrated.

Our detailed breakdown of data transfer and egress costs covers every scenario.

4. S3

S3 cost spikes usually come from one of three places: request charges, storage class mismatches, or accidental versioning accumulation.

Request charges hit teams that make millions of small API calls. A GET request costs $0.0004 per 1,000 requests. Sounds cheap until your application makes 500 million GET requests per month. That is $200/month just in request fees.

Storage class mismatches happen when lifecycle policies are not set up. Every object stays in S3 Standard ($0.023/GB) even though 90% of it has not been accessed in months and should be in Intelligent-Tiering or Infrequent Access ($0.0125/GB).

Versioning accumulation is the silent one. S3 versioning keeps every version of every overwritten object. If your application overwrites objects frequently (think log aggregation or cache storage), versions accumulate silently. Your bucket appears to hold 500GB but actually stores 5TB when you include all versions.

Where to look: Enable S3 Storage Lens for a dashboard that shows storage class distribution, request patterns, and cost efficiency metrics across all your buckets.

5. RDS and Database Services

RDS cost spikes happen when Multi-AZ failovers create unexpected charges, when automated backups accumulate beyond the free tier, or when dev/test databases run on production-sized instances.

The most common pattern: someone creates an RDS instance for testing, chooses db.r6g.2xlarge "because the production config uses it," enables Multi-AZ "just in case," and forgets about it. Cost: $1,460/month for a database nobody queries.

Where to look: List all RDS instances and check their CloudWatch metrics. Any instance with average CPU below 10% and fewer than 100 connections per day is a candidate for downsizing or termination.

Tactic 1: Set Up AWS Cost Anomaly Detection in 15 Minutes

This is the single most important thing you can do, and it is completely free.

AWS Cost Anomaly Detection uses machine learning to identify unusual spend patterns in your account. It runs continuously and alerts you when something looks off.

Here is exactly how to set it up:

Go to AWS Cost Management in the console
Click Cost Anomaly Detection in the left sidebar
Click Create monitor
Choose AWS services as the monitor type (this covers all services)
Set the alert threshold to $100 daily impact (adjust based on your spend)
Create an SNS topic and subscribe your Slack webhook or team email
Click Create

That is it. In 15 minutes, you have machine learning monitoring your entire AWS bill and alerting you when something abnormal happens. The model learns your spending patterns over 2 weeks and then starts flagging anomalies.

Pro tip: Create a second monitor with a lower threshold ($50) for your non-production accounts. Cost spikes in dev and staging often indicate misconfigurations that will eventually hit production.

Tactic 2: Build a 30-Minute Spike Investigation Playbook

When a cost spike alert arrives, you need a repeatable process that gets you to the root cause fast. Not a vague "investigate the spike." A step-by-step playbook that anyone on the team can follow.

Here is the playbook we use:

Minutes 0-5: Scope the Spike

Open Cost Explorer. Set the time range to the last 7 days with daily granularity. Group by Service. Identify which service or services spiked.

Then switch to Linked Account grouping if you use AWS Organizations. This tells you which account the spike occurred in.

Minutes 5-10: Narrow to Region and Usage Type

Within the spiking service, filter by Region. Then group by Usage Type. Usage types are far more specific than service names. For EC2, you will see entries like USW2-BoxUsage:c5.4xlarge which tells you the exact instance type and region.

Minutes 10-20: Identify the Resource

For EC2 spikes: Go to the EC2 console in the identified region. Sort instances by launch time. Look for instances launched around when the spike started. Check their tags to identify the owner and purpose.

For data transfer spikes: Use AWS Cost and Usage Reports (CUR) with Athena to query transfer details. Filter by line_item_usage_type containing "DataTransfer" and sort by cost.

For S3 spikes: Check S3 Storage Lens or CloudWatch metrics for the buckets in the affected region. Look for spikes in request count or storage size.

Minutes 20-30: Confirm Root Cause and Remediate

By now you should know the specific resource, region, and usage type causing the spike. Confirm by checking:

Recent deployments or Terraform applies in that account
Auto Scaling Group activity logs
CloudTrail for API calls that created or modified resources
CI/CD pipeline runs that might have triggered workloads

Then take immediate action: terminate, downsize, or reconfigure the responsible resource.

The move: Write this playbook into a Confluence page or runbook. Share it with your entire engineering team. When the next spike hits, anyone can investigate it, not just the one person who "knows where to look."

Tactic 3: Implement Granular Resource Tagging (For Real This Time)

You have heard "tag your resources" a thousand times. Here is why most teams still fail at it and what to do differently.

The problem is not that teams do not know about tagging. The problem is that tagging is optional by default. When it is optional, it does not happen consistently. And inconsistent tagging is almost as useless as no tagging.

Make Tagging Mandatory With AWS Service Control Policies

AWS Organizations Service Control Policies (SCPs) can require specific tags on resource creation. If a resource does not have the required tags, the API call fails and the resource is not created.

Here is an example SCP that requires team, environment, and cost-center tags on all EC2 instances:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireTagsOnEC2",
      "Effect": "Deny",
      "Action": ["ec2:RunInstances"],
      "Resource": ["arn:aws:ec2:*:*:instance/*"],
      "Condition": {
        "Null": {
          "aws:RequestTag/team": "true",
          "aws:RequestTag/environment": "true",
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

Apply this to your development and staging OUs first. Let teams adapt. Then roll it out to production.

The Minimum Viable Tag Set

You do not need 20 tags. You need 5 that are always present:

Tag Key	Purpose	Example
`team`	Who owns this resource	`platform`, `ml-team`, `backend`
`environment`	Lifecycle stage	`prod`, `staging`, `dev`, `sandbox`
`cost-center`	Financial attribution	`eng-platform`, `eng-ml`
`service`	Application or microservice	`payment-api`, `recommendation-engine`
`managed-by`	How it was created	`terraform`, `cdk`, `manual`

The managed-by tag is the one most teams skip, and it is incredibly valuable during spike investigation. If a resource was created manually (not through IaC), it is far more likely to be misconfigured or forgotten.

Tactic 4: Set Service-Level Budget Alarms (Not Just Account-Level)

Most teams set one budget alarm for their entire AWS account. "Alert me if total spend exceeds $50,000/month." This is nearly useless for catching spikes because your total spend has to increase dramatically before the alarm triggers.

Service-level budgets are far more sensitive. If your EC2 spend is normally $15,000/month and suddenly hits $18,000 by day 15, that 20% increase is a clear signal. But if your total account spend is $80,000 and increases to $83,000, that 3.75% change might not even trigger your account-level alarm.

How to Set It Up

Create AWS Budgets for your top 5 services by cost:

EC2 (including EBS and data transfer)
RDS (including backup storage)
S3 (including request fees)
Lambda (including provisioned concurrency)
Data Transfer (cross-AZ, cross-region, internet)

For each budget, set three alert thresholds:

50% of monthly budget reached by the 15th of the month (early warning)
75% of monthly budget reached by the 22nd (escalation)
100% of monthly budget reached at any point (immediate action)

Also set daily budgets for services with high spike risk. A daily EC2 budget of $600 that triggers at 120% ($720) will catch a scaling misconfiguration on the same day it happens.

Tactic 5: Use Cost Allocation Reports to Connect Spikes to Deployments

Here is a tactic that most teams overlook entirely, and it is one of the most powerful for preventing recurring spikes.

AWS Cost and Usage Reports (CUR) contain line-item detail for every charge in your account. When you load CUR data into Athena or a data warehouse, you can query it with SQL to answer specific questions like:

"Which EC2 instances launched between March 10 and March 12 cost more than $100?"
"What was the total data transfer cost from the payment-api service last week?"
"Which S3 buckets had more than 10 million requests this month?"

But the real power comes when you correlate CUR data with your deployment history.

The Deployment-Cost Correlation

Export your CI/CD deployment timestamps (from GitHub Actions, Jenkins, or whatever you use). Join them with CUR data by time range and tagged service name. This gives you a view that shows exactly which deployments caused cost increases.

When you can say "the deployment of payment-api v2.3.1 on March 11 at 2:14pm increased daily EC2 costs by $340 due to a new autoscaling policy," you have transformed cost management from reactive to preventive. The team that deployed that change can review the scaling configuration and fix it before the next release.

The move: Set up CUR export to S3 and create an Athena table. Even if you do not build the full correlation pipeline immediately, having the data available means you can query it ad-hoc during your next spike investigation.

Tactic 6: Automate the Fixes That Humans Keep Forgetting

Every spike investigation ends with someone saying "we should set up automation for that." And then they do not, because there is always something more urgent. Until the same spike happens again next month.

Here are the four automations that prevent the most common AWS cost spikes:

Auto-Stop Dev Environments

Use AWS Instance Scheduler to stop all dev and staging EC2 and RDS instances outside business hours. This prevents the "dev instance running all weekend" spike that hits almost every team.

Estimated savings: 65% of non-production compute costs.

Auto-Delete Orphaned Resources

Write a Lambda function triggered by a daily EventBridge schedule that finds and deletes:

Unattached EBS volumes older than 7 days
Unused Elastic IPs
Empty load balancers with zero healthy targets
Snapshots older than 90 days without a do-not-delete tag

Estimated savings: 5% to 10% of total spend from resource cleanup alone.

Auto-Enforce S3 Intelligent-Tiering

Set a default bucket policy that applies S3 Intelligent-Tiering to all new objects. This moves data between access tiers automatically based on usage patterns, eliminating storage class mismatches without any manual intervention.

Auto-Alert on NAT Gateway Traffic Spikes

Create a CloudWatch alarm on the BytesOutToDestination metric for each NAT Gateway. Set the threshold at 150% of your 7-day average. When triggered, the alarm posts to Slack with the NAT Gateway ID and the estimated daily cost, giving your team immediate visibility into data transfer spikes.

For more on building automated cloud operations, explore our Automated Cloud Operations service.

Tactic 7: Run a Monthly "Cost Spike Retrospective"

Prevention is better than detection. And the best way to prevent future spikes is to learn from past ones.

At the end of each month, run a 30-minute retrospective that reviews every cost anomaly from the previous month. For each anomaly, document:

What spiked: Service, region, account, and approximate cost impact
When it started: Date and time the anomaly began
How long it ran: Hours or days before detection and remediation
Root cause: What specifically caused the cost increase
Detection method: How was it discovered (alert, manual review, finance report)
Time to resolution: How long from detection to remediation
Prevention action: What automation, policy, or process change will prevent recurrence

After three months of retrospectives, you will have a clear picture of your most common spike patterns. Most teams discover that 80% of their spikes come from just 2 to 3 root causes. Fix those root causes, and your monthly bill becomes dramatically more predictable.

The move: Schedule the first retrospective for the end of this month. Invite engineering, ops, and one finance partner. Make it a standing meeting. The cost insights compound over time.

For expert help building your entire FinOps practice, explore our Cloud Cost Optimization and FinOps service.

The Quick Reference: AWS Cost Spike Investigation Cheat Sheet

Bookmark this for your next spike:

Symptom	First Place to Look	Common Cause	Quick Fix
EC2 cost spike	ASG activity history	Scaling misconfiguration	Adjust min/max capacity, fix scaling metric
Data transfer spike	VPC Flow Logs, NAT Gateway metrics	Cross-AZ traffic, missing VPC endpoints	Add VPC endpoints, optimize service placement
S3 cost spike	S3 Storage Lens, CloudWatch requests metric	Request volume, versioning bloat	Enable Intelligent-Tiering, set version lifecycle
RDS cost spike	RDS instance list, CloudWatch CPU	Oversized dev/test instances	Downsize or terminate idle instances
Lambda cost spike	Lambda metrics dashboard	Recursive invocations, high memory config	Fix recursion, right-size memory allocation
NAT Gateway spike	NAT Gateway CloudWatch metrics	Application routing through NAT unnecessarily	Create VPC endpoints for S3, DynamoDB, etc.

Your AWS Bill Does Not Have to Be a Surprise

Here is the thing about AWS cost spikes. They feel random but they are not. Every spike has a cause. Every cause has a pattern. And every pattern has a prevention.

The teams that get control of their AWS costs are not the ones with the biggest budgets or the most sophisticated tools. They are the ones who built a system: anomaly detection that catches problems in hours, a playbook that traces root causes in minutes, tagging that makes attribution instant, and automation that prevents the most common mistakes from recurring.

You can build that system in a month. Start with anomaly detection this week. Add the investigation playbook next week. Enforce tagging the week after. Set up one automation the week after that. Four weeks from now, your AWS bill will be the most predictable number in your entire business.

Want to find all the waste hiding in your AWS account right now? Take our free Cloud Waste and Risk Scorecard for a personalized assessment in under 5 minutes.

Related reading:

External resources:

Stop the Bleed: 7 Proven Tactics to Detect and Prevent AWS Cloud Cost Spikes