Back to Engineering Insights
Cloud Optimization
Feb 27, 2026
By LeanOps Team

AWS Sprawl Is Silently Eating 30-40% of Your Cloud Budget: Here's How to Stop It

AWS Sprawl Is Silently Eating 30-40% of Your Cloud Budget: Here's How to Stop It

Let me describe a pattern I have seen at nearly every SaaS and AI startup that has been on AWS for more than 18 months.

You started lean. A few EC2 instances, one RDS database, an S3 bucket or two. Your AWS bill was predictable and proportional to what you were building. Then you grew. New features needed new services. The team expanded. Developers spun up environments for testing, experimentation, demos. Someone launched a GPU instance for an ML experiment. Someone else provisioned a NAT Gateway for a VPC that was supposed to be temporary.

Fast forward to today. Your AWS account has hundreds of resources you cannot fully account for. Your bill is growing 15 to 25 percent quarter over quarter, but your traffic and revenue are not growing at the same rate. You have a vague sense that there is waste in there, but you do not know where it is, how much it is, or how to find it without spending a week inside Cost Explorer.

This is AWS sprawl. And if your experience matches what I just described, you are probably overspending by 30 to 40 percent right now.

This post is going to teach you how to find every dollar of sprawl in your AWS account, understand why it accumulated in the first place, and eliminate it with systems that prevent it from coming back.


What AWS Sprawl Actually Looks Like on a Bill

Before we talk about solutions, let me show you what sprawl looks like when you actually dissect an AWS invoice line by line. Most teams look at the top-level total and maybe the service-level breakdown. That is not enough to see sprawl. You need to go one level deeper.

Here are the seven sprawl patterns that account for the majority of AWS overspending. I am going to get specific about each one because generic advice like "right-size your instances" does not help if you do not know what to look for.

Pattern 1: The Forgotten Compute Fleet

Every growing engineering team accumulates EC2 instances that nobody owns. They were created for a sprint that ended, a load test that finished, a proof of concept that was abandoned, a staging environment that was supposed to be temporary. They are still running. They are still billing.

Here is how to find them. Pull your EC2 instance list and sort by average CPU utilization over the past 14 days. Any instance below 5 percent average CPU is either idle or doing so little work that it could be consolidated onto a smaller instance or eliminated entirely. In a typical startup AWS account with 50 to 200 EC2 instances, 15 to 30 percent fall into this category.

The cost is not trivial. An m5.xlarge running idle in us-east-1 costs $140 per month. Ten of them is $1,400 per month, $16,800 per year, for compute that produces zero value.

But here is the part most people miss: the EC2 instance is only the beginning of the cost. Each forgotten instance likely has an attached EBS volume ($10 to $50/month depending on size and type), sits in a subnet behind a NAT Gateway (more on that shortly), generates CloudWatch metrics and logs, and might have an associated Elastic IP. The true all-in cost of a forgotten instance is typically 1.5 to 2x the EC2 charge alone.

Pattern 2: EBS Volume Graveyard

When an EC2 instance is terminated, its root EBS volume is deleted by default (unless someone changed that setting). But additional attached volumes? Those persist. They sit in your account, unattached, silently charging you every month.

EBS snapshots are even worse. Every time someone creates a snapshot for backup or AMI creation, that snapshot persists until explicitly deleted. Over months and years, snapshot storage accumulates into a significant line item. I have audited AWS accounts where EBS snapshot costs exceeded $2,000 per month and nobody on the team even knew snapshots were a billable resource.

To find this waste: filter your EBS volumes by attachment status. Every volume showing "available" (unattached) is a candidate for deletion. For snapshots, check the age and whether the source volume or AMI still exists. Snapshots older than 90 days for volumes that no longer exist are almost certainly deletable.

Pattern 3: The NAT Gateway Money Pit

This is one of the most underappreciated cost drivers in AWS, and it is directly connected to sprawl.

A NAT Gateway costs $0.045 per hour ($32.40 per month) just to exist, plus $0.045 per GB of data processed through it. If you have private subnets that need internet access (which most production VPCs do), you are paying for at least one NAT Gateway.

Here is where sprawl makes this expensive. As teams create new VPCs for different environments, different services, or different experiments, each VPC gets its own NAT Gateway. Three VPCs in one region? Three NAT Gateways at $97 per month before a single byte of data flows through them. Add data processing charges for services pulling Docker images, downloading packages, making API calls to external services, and a single NAT Gateway can easily cost $200 to $500 per month.

The fix involves consolidating VPCs where possible, using VPC endpoints to bypass NAT Gateways for AWS service traffic (S3, DynamoDB, ECR, and other services have gateway or interface endpoints that eliminate NAT Gateway data processing charges), and reviewing whether every VPC actually needs a NAT Gateway at all.

For a deeper dive into how NAT Gateway and data transfer costs compound, our post on AWS NAT Gateway and AI cost optimization covers the mechanics in detail.

Pattern 4: S3 Buckets With No Lifecycle Strategy

S3 Standard storage costs $0.023 per GB per month. S3 Infrequent Access costs $0.0125. S3 Glacier Instant Retrieval costs $0.004. That is an 83 percent price reduction for data you access less than once a month.

In sprawled AWS accounts, the vast majority of S3 data sits in Standard tier regardless of access patterns. Application logs from six months ago. Old deployment artifacts. Training data from completed ML experiments. Backup files that nobody will ever restore.

The waste is proportional to your data volume. At 10 TB of S3 Standard with 70 percent of that data rarely accessed, moving the cold data to appropriate tiers saves roughly $95 per month. At 100 TB, that is $950 per month. At petabyte scale, it is thousands.

The tool you need is S3 Storage Lens. It shows you access patterns per bucket and per prefix, which tells you exactly what data is hot and what should move to a cheaper tier. Then set lifecycle rules to automate the transitions so they do not require human intervention.

For teams with large storage footprints, our cheapest cloud storage guide compares S3 against alternatives like R2 and Backblaze B2 for different workload patterns.

Pattern 5: RDS Instances Running at 3x the Size They Need

Database instances are one of the hardest resources to right-size because everyone is terrified of database performance issues. So they overprovision, and nobody ever revisits the decision.

Pull your RDS CloudWatch metrics for CPU utilization, freeable memory, and read/write IOPS over the past 30 days. If your database is running below 20 percent CPU utilization and has more than 50 percent freeable memory consistently, it is oversized. A db.r5.2xlarge ($700/month) running at 15 percent CPU could likely be a db.r5.xlarge ($350/month) with no performance impact.

The scary part of RDS right-sizing is the downtime required for instance class changes. Multi-AZ deployments can mitigate this with a failover-based resize, but it still requires planning. This is why teams never do it. They provision big, the database works fine, and nobody touches it for two years.

Automation helps here. AWS Compute Optimizer provides RDS right-sizing recommendations based on actual utilization. The recommendation is free. Acting on it requires scheduling a maintenance window, which is the human bottleneck that automation pipelines can solve.

Pattern 6: Lambda Functions With Bloated Memory Configurations

Lambda pricing is based on the number of requests, the duration of each invocation, and the amount of memory allocated. Most teams set Lambda memory to 512 MB or 1024 MB because it is easy and safe, without testing whether their function actually needs that much.

Here is the counterintuitive part: Lambda allocates CPU proportionally to memory. So allocating more memory also gives you more CPU, which makes functions run faster. The cost-optimal configuration is the memory setting where the cost of the execution time multiplied by the memory price is minimized. That is almost never the default.

AWS Lambda Power Tuning is a step function that automatically runs your Lambda at different memory configurations and finds the optimal price-to-performance setting. For teams running Lambda at scale (millions of invocations per month), power tuning each function can reduce Lambda costs by 20 to 40 percent.

Pattern 7: CloudWatch Log Groups With No Retention Policy

By default, CloudWatch Logs retain data forever. Every log statement from every Lambda function, every ECS task, every application running on EC2 is stored indefinitely unless you set a retention policy.

CloudWatch Logs ingestion costs $0.50 per GB, and storage costs $0.03 per GB per month. A moderately verbose application generating 10 GB of logs per day accumulates 300 GB per month. After one year, you have 3.6 TB of logs in CloudWatch costing $108 per month in storage alone, on top of the $1,800 you spent on ingestion.

The fix is dead simple: set retention policies on every log group. Most teams need 30 to 90 days of log retention for operational purposes. Anything older can be exported to S3 (at $0.023/GB/month, a 77 percent savings over CloudWatch storage) or deleted entirely.

Go to your CloudWatch Log Groups console right now. Sort by stored bytes. I guarantee you will find log groups with hundreds of gigabytes of data and a retention setting of "Never expire."


Why AWS Sprawl Keeps Coming Back (And How to Stop the Cycle)

Cleaning up sprawl is the easy part. Preventing it from returning is the actual challenge.

Sprawl comes back because the incentives and workflows that created it in the first place are unchanged. Developers need infrastructure fast. They have the IAM permissions to create it. There is no cost feedback at creation time. And the cleanup responsibility falls on nobody specific.

Here are the three systemic changes that break the cycle permanently.

Change 1: Cost Visibility at the Point of Creation

When a developer opens a pull request that includes a Terraform change provisioning a new RDS instance, they should see the estimated monthly cost of that resource in the PR itself, before it merges.

Infracost does exactly this. It integrates into GitHub, GitLab, and Bitbucket PR workflows and comments with the cost impact of every infrastructure change. When engineers see "$347/month for this new database" directly in the PR, they make different decisions than when the cost is invisible until the monthly invoice.

This is the single most effective sprawl prevention mechanism available. It does not block anything. It does not slow anyone down. It just makes cost visible at the moment the spending decision is made.

Change 2: Mandatory Resource Tagging

If you cannot attribute a resource to a team, a service, or a project, you cannot hold anyone accountable for its cost. And unaccountable resources are the ones that become sprawl.

Use AWS Service Control Policies (SCPs) to enforce tag requirements at the organization level. Require at minimum: team, service, environment, and cost-center tags on every taggable resource. Deny resource creation that does not include these tags. No exceptions.

This feels aggressive, and developers will push back initially. But mandatory tagging is the foundation of every other cost governance practice. Without it, you cannot allocate costs to teams, you cannot detect anomalies per service, and you cannot identify owners for sprawled resources.

Change 3: Automated Expiration for Non-Production Resources

Every non-production resource should have a TTL (time to live). Dev environments expire after 7 days. Feature branch environments expire when the branch is merged or deleted. Load testing infrastructure expires after the test window.

Implement this through Terraform resource lifecycle rules, Kubernetes namespace TTLs, or a simple Lambda function that scans for resources tagged "environment=dev" older than a threshold and terminates them.

The goal is to make non-production infrastructure ephemeral by default. If someone needs a resource for longer than the default TTL, they can explicitly extend it. But the default state is "this will be cleaned up automatically," not "this will run forever unless someone remembers to delete it."


The 30-Day AWS Sprawl Elimination Playbook

Here is the exact sequence we follow at LeanOps during cloud cost optimization engagements when the primary issue is AWS sprawl. Each phase builds on the previous one, and each delivers measurable savings.

Days 1 to 7: Discovery and Quick Wins

  • Enable Cost Explorer at the resource level if not already active
  • Audit all EC2 instances for sub-5-percent CPU utilization over 14 days
  • Identify and delete unattached EBS volumes and orphaned snapshots
  • List all NAT Gateways and calculate per-VPC cost
  • Set CloudWatch Log Group retention policies to 30 or 90 days

Expected savings: 10 to 15 percent of total AWS spend

These are the easiest wins because they involve removing resources that are not doing anything. Zero risk to production.

Days 8 to 14: Right-Sizing

  • Run Compute Optimizer analysis on all EC2 and RDS instances
  • Identify instances that can downsize by one or more instance classes
  • Schedule right-sizing changes during maintenance windows
  • Run Lambda Power Tuning on all functions with more than 100,000 monthly invocations

Expected savings: additional 10 to 20 percent

Days 15 to 21: Storage and Networking Optimization

  • Run S3 Storage Lens analysis on all buckets over 100 GB
  • Implement lifecycle rules to transition cold data to IA or Glacier
  • Deploy VPC endpoints for S3, DynamoDB, and ECR to reduce NAT Gateway data processing
  • Consolidate VPCs where separate VPCs are not architecturally justified

Expected savings: additional 5 to 10 percent

For teams managing storage across multiple providers, our cloud storage pricing comparison helps evaluate whether S3 is the right choice for every workload.

Days 22 to 30: Prevention and Governance

  • Deploy Infracost in CI/CD pipelines for cost visibility on infrastructure changes
  • Implement SCP-based mandatory tagging across all accounts
  • Set up TTL policies for non-production resources
  • Configure AWS Cost Anomaly Detection at the service level
  • Establish weekly 15-minute cost review cadence with engineering leads

Expected result: governance framework that prevents sprawl from returning


The Numbers That Tell You Sprawl Is a Problem

You do not need a full audit to know if sprawl is eating your budget. Check these three indicators:

Cloud cost as a percentage of revenue. For SaaS companies, cloud infrastructure should be 15 to 25 percent of revenue in the growth stage. If you are above 30 percent, sprawl is a likely contributor.

Month-over-month cost growth vs. traffic growth. If your AWS bill is growing 10 percent month over month but your traffic is growing 3 percent, the gap is waste. Efficient infrastructure costs should grow proportionally to usage.

Number of untagged resources. Log into AWS Resource Groups and Tag Editor. Filter for resources with no tags. If more than 20 percent of your billable resources are untagged, you have a governance gap that enables sprawl.


What Most Companies Get Wrong About AWS Sprawl

They treat it as a cleanup project instead of a systems problem. You can clean up every idle resource today and be back to the same sprawl level in six months if you do not change the processes that create waste. The governance changes (cost visibility in PRs, mandatory tagging, resource TTLs) matter more than the initial cleanup.

They try to solve it with a tool instead of a practice. No tool eliminates sprawl on its own. Tools provide visibility and automation. But someone needs to act on what the tools surface, and the organization needs to care enough about cost efficiency to make it a priority. The FinOps Foundation's operating model provides a solid framework for building this organizational practice.

They optimize only the biggest line items and ignore the long tail. The top five cost drivers are easy to find and optimize. But 30 to 40 percent of sprawl cost comes from dozens of small resources: $20 here, $50 there, an unused IP, a forgotten snapshot, a log group with infinite retention. Individually they are not worth a human's time to investigate. Collectively they add up to thousands per month. This is exactly the type of waste that automated lifecycle management catches and human-led reviews miss.


Start This Week

Open AWS Cost Explorer. Go to the "Daily costs" view. Look at the last 90 days. If the trend line is going up and to the right faster than your business metrics, sprawl is the most likely cause.

If you want help eliminating it, LeanOps runs 90-day cloud cost optimization engagements focused on exactly this problem. We do the discovery, the cleanup, the right-sizing, and the governance implementation. If we do not deliver at least 30 percent savings, you do not pay.

For teams that want ongoing infrastructure management after the sprawl is eliminated, our Cloud Operations services provide continuous monitoring, scaling, and cost governance without adding headcount.


Further reading: