Why Most Cloud Cost "Optimizations" Fail
You have probably seen the advice before. Turn off idle resources. Use reserved instances. Right-size your compute. Simple enough, right?
Yet most teams that follow that advice still overspend by 30-50%. The reason is that generic recommendations miss the specific, often hidden cost drivers that vary by workload, provider, and architecture. A Kubernetes cluster has completely different waste patterns than a serverless application. An AI training pipeline wastes money in places a standard web app never would.
This checklist is different. It covers 47 specific actions across seven cost categories, each with the expected savings range and the exact tool or setting to change. Print it. Share it with your team. Work through it section by section. Most teams recover 30-60% of their cloud spend within 30 days of completing this audit.
Before You Start: The 3 Things You Need
1. Cost visibility across all accounts. Enable AWS Cost Explorer, Azure Cost Management, or GCP Cloud Billing Reports. If you have multiple accounts, consolidate them first.
2. Resource tagging in place. Without tags, you cannot attribute costs to teams, projects, or environments. Enable AWS Cost Allocation Tags, GCP Labels, or Azure Tags. At minimum, tag every resource with: environment (prod/staging/dev), team, and application.
3. At least 30 days of billing data. You need a full billing cycle to spot patterns. Do not optimize based on a single day or week of data.
Got those three? Good. Let us get into it.
Section 1: Compute Optimization (Typical Savings: 25-45%)
Compute is the largest line item on most cloud bills, and also where the most waste hides.
The Checklist
-
1. Identify instances with average CPU below 10%. These are massively overprovisioned. Use AWS Compute Optimizer, GCP VM Rightsizing Recommendations, or Azure Advisor. Savings: 30-60% per instance.
-
2. Right-size instances with CPU between 10-40%. Drop one size down. A c6g.xlarge at 25% average CPU should be a c6g.large. Savings: 40-50% per instance.
-
3. Switch from Intel to ARM-based instances. AWS Graviton (m7g, c7g, r7g), Azure Cobalt, and GCP Tau T2A deliver 20-30% better price-performance for the same workload. Most modern applications (Node.js, Python, Java, Go, .NET) run without modification. Savings: 20-30%.
-
4. Use spot/preemptible instances for fault-tolerant workloads. CI/CD pipelines, batch processing, data pipelines, dev environments. AWS Spot, GCP Preemptible VMs, Azure Spot VMs. Savings: 60-90% vs. on-demand.
-
5. Schedule non-production environments to stop after hours. Dev, staging, QA, and load-test environments running 24/7 waste 65-70% of their cost. Use AWS Instance Scheduler or equivalent. Savings: 65-70% on non-prod compute.
-
6. Eliminate zombie instances. Check for instances with near-zero network traffic for the past 14+ days. These are likely forgotten deployments, old test environments, or decommissioned services still running. Savings: 100% per zombie.
-
7. Use auto-scaling with appropriate min/max bounds. Set minimum instances based on your lowest traffic point (usually 2-4 AM local time), not your average. Too many teams set the minimum at their average load, which means they overpay during low-traffic hours. Savings: 15-25%.
-
8. Review and consolidate underused load balancers. Each ALB costs $16+/month plus data processing fees. Teams commonly have one ALB per microservice when a single ALB with path-based routing handles multiple services. Savings: $16-50/month per consolidated ALB.
Section 2: Database Optimization (Typical Savings: 20-40%)
Databases are the second-largest cost driver and the most commonly overprovisioned resource category.
The Checklist
-
9. Right-size RDS/Cloud SQL instances. If average CPU is below 20% and memory usage below 40%, drop one or two sizes. A db.r6g.xlarge ($548/month) running at 15% CPU should be a db.t4g.large ($196/month). Savings: $350+/month per instance.
-
10. Evaluate Aurora Serverless v2 for variable workloads. If your database traffic spikes during business hours and drops near zero overnight, Aurora Serverless v2 scales to 0.5 ACU during quiet periods. For databases that idle 12+ hours per day, this can cut costs by 50-70% compared to provisioned instances. Savings: 50-70% for bursty workloads.
-
11. Switch from commercial database engines to open-source. RDS for Oracle costs 3-5x more than RDS for PostgreSQL for equivalent instance sizes due to embedded license fees. Unless you have hard dependencies on Oracle-specific features (PL/SQL packages, RAC), PostgreSQL handles most workloads. Savings: 60-80% per database.
-
12. Delete old database snapshots. Automated snapshots are retained based on your backup retention setting, but manual snapshots persist forever until deleted. Check for manual snapshots older than 90 days. At $0.095/GB/month on RDS, a 500GB database snapshot costs $47.50/month. Ten old snapshots? $475/month doing nothing. Savings: $50-500+/month.
-
13. Use read replicas instead of scaling up. Instead of upgrading your primary database instance for read-heavy workloads, add a read replica and route read queries to it. A read replica costs the same as the instance size, but you can use a smaller instance for reads-only. Savings: 20-40% vs. scaling the primary.
-
14. Enable storage autoscaling with a maximum limit. Both RDS and Cloud SQL can auto-expand storage, but without a maximum, a runaway query or log accumulation can expand storage to hundreds of GB that you never reclaim (cloud providers do not shrink allocated storage). Set a sensible maximum. Risk prevention: avoids surprise $100-500+/month increases.
-
15. Review DynamoDB/Cosmos DB capacity mode. If you are on provisioned capacity but your traffic is unpredictable, switch to on-demand. If your traffic is steady and predictable, provisioned with auto-scaling is 30-50% cheaper than on-demand. Match the billing model to your pattern. Savings: 30-50%.
Section 3: Storage Optimization (Typical Savings: 30-60%)
Storage costs creep up slowly and then suddenly dominate your bill.
The Checklist
-
16. Implement lifecycle policies on all object storage buckets. Move data to cheaper tiers automatically. Standard to Infrequent Access at 30 days, to Glacier at 90 days, to Deep Archive at 365 days. Or use S3 Intelligent-Tiering to automate this entirely. Savings: 40-96% on aged data.
-
17. Delete unattached EBS volumes. Go to EC2 > Volumes > filter by "Available" status. Every volume listed is costing money and attached to nothing. This is the single most common zombie resource. Savings: $8-100+/month per volume.
-
18. Downgrade EBS volume types. If you provisioned gp3 or io2 volumes for a workload that turned out to be sequential/low-IOPS, switch to gp3 at baseline (3,000 IOPS free) or even sc1 for cold data. Most application logs and archived data do not need premium IOPS. Savings: 30-80% per volume.
-
19. Compress data before storing. This sounds obvious, but a surprising number of teams store raw JSON, CSV, and log files without compression. gzip typically achieves 70-90% compression on text data. Parquet or ORC format for analytics data reduces storage by 75%+ compared to raw CSV. Savings: 70-90% on storage costs for compressible data.
-
20. Clean up old container images. ECR, GCR, and ACR store every container image you have ever pushed. Set lifecycle policies to retain only the last 10-20 tagged images per repository. Teams with active CI/CD pipelines commonly accumulate hundreds of images at $0.10/GB/month. Savings: $10-100+/month.
-
21. Review and reduce CloudWatch/Stackdriver log retention. Default log retention on CloudWatch is "never expire." Most logs are only useful for 14-30 days. Set retention policies and archive older logs to S3 if you need them for compliance. At $0.50/GB ingestion + $0.03/GB/month storage, verbose logging can cost hundreds per month. Savings: 50-80% on logging costs.
Section 4: Networking and Data Transfer (Typical Savings: 15-35%)
Networking costs are the least visible and most confusing line items. Most teams do not realize they are paying for them until the bill arrives.
The Checklist
-
22. Audit NAT Gateway usage. NAT Gateways cost $0.045/hour ($32/month) plus $0.045/GB processed. If you are routing S3 or DynamoDB traffic through NAT, set up VPC Gateway Endpoints (free) to bypass it. Savings: $50-500+/month.
-
23. Minimize cross-AZ data transfer. Every byte moving between availability zones costs $0.01/GB each way ($0.02/GB round trip). Place tightly-coupled services in the same AZ for non-critical workloads. For production workloads that need multi-AZ, consider if every service truly needs to communicate cross-AZ. Savings: $100-1,000+/month at scale.
-
24. Use a CDN for static assets. CloudFront, Cloudflare, or Cloud CDN cache content at edge locations and reduce origin egress. Cloudflare's free tier includes unlimited bandwidth with no egress fees. Savings: 40-80% on static content delivery costs.
-
25. Review and reduce public IP costs. As of February 2024, AWS charges $0.005/hour ($3.65/month) per public IPv4 address, whether attached or not. Audit all Elastic IPs, public-facing load balancers, and NAT gateways. Remove any that are not actively needed. Savings: $3.65/month per IP.
-
26. Evaluate alternative providers for egress-heavy workloads. If data transfer is a major cost driver, consider Cloudflare R2 (zero egress fees), Backblaze B2 (free egress to Cloudflare), or Wasabi (no egress fees). Moving 10TB/month of egress from S3 ($0.09/GB = $900/month) to R2 ($0/GB) saves $900/month. Savings: up to 100% on egress costs.
-
27. Consolidate VPCs and reduce peering complexity. Each VPC peering connection, transit gateway attachment, and VPN connection has associated costs. Teams that create separate VPCs per environment or per service often end up with expensive hub-and-spoke networking when a simpler architecture would suffice. Savings: $50-300+/month.
Section 5: Kubernetes and Container Optimization (Typical Savings: 25-50%)
Kubernetes is powerful but notoriously easy to overspend on. These items apply if you run EKS, GKE, or AKS.
The Checklist
-
28. Set resource requests and limits on every pod. Without requests, the Kubernetes scheduler cannot pack pods efficiently. Without limits, a single pod can consume an entire node's resources. Pods without requests and limits are the number one cause of Kubernetes cost waste. Savings: 20-40% on node costs.
-
29. Right-size resource requests based on actual usage. Use Vertical Pod Autoscaler (VPA) in recommendation mode or tools like Kubecost to see actual CPU and memory usage per pod. Most pods request 2-5x more resources than they use. Savings: 30-60% on overprovisioned pods.
-
30. Enable Cluster Autoscaler or Karpenter. Cluster Autoscaler adds and removes nodes based on pending pods. Karpenter (AWS) is faster and more cost-efficient, provisioning right-sized nodes in seconds. Without autoscaling, you are paying for peak capacity 24/7. Savings: 25-40%.
-
31. Use spot instances for Kubernetes worker nodes. Configure node groups with a mix of on-demand (for system-critical pods) and spot instances (for stateless application pods). Use pod disruption budgets to handle spot interruptions gracefully. Savings: 60-80% on worker node costs.
-
32. Evaluate whether you actually need Kubernetes. The EKS control plane costs $73/month. Add node groups, load balancers, EBS volumes, and engineering time, and the overhead is $500-2,000/month before running a single workload. If you are running fewer than 8-10 services, ECS Fargate, Cloud Run, or Azure Container Apps may cost 40-60% less with far less operational burden. Savings: 40-60% if you can simplify.
-
33. Reduce namespace sprawl. Each namespace with its own monitoring stack, ingress controller, and service mesh sidecar adds overhead. Consolidate where possible. Savings: $50-200/month per eliminated namespace overhead.
For a deeper dive into Kubernetes costs, read our guide on the hidden Kubernetes tax.
Section 6: Commitment and Pricing Optimization (Typical Savings: 20-40%)
These are the "free money" optimizations that require commitment but no architectural changes.
The Checklist
-
34. Purchase Savings Plans or Reserved Instances for stable workloads. If a resource has run consistently for 3+ months and you plan to keep it for another year, commit to it. 1-year no-upfront reservations save 30-40%. 3-year all-upfront saves 55-65%. Savings: 30-65%.
-
35. Use Compute Savings Plans for flexibility. Unlike Reserved Instances that lock you to a specific instance type and region, Compute Savings Plans apply across any instance family, size, OS, tenancy, and region. Less savings (20-30%) but much more flexibility. Savings: 20-30%.
-
36. Audit existing reservations for waste. Check if you have Reserved Instances for instance types you no longer use. Unused reservations still charge you. Use the AWS Cost Explorer RI Utilization report to check. Sell unused RIs on the Reserved Instance Marketplace. Recovery: recoup unused commitment costs.
-
37. Negotiate Enterprise Discount Programs (EDPs). If your annual cloud spend exceeds $100,000, you likely qualify for an AWS EDP, GCP Committed Use Discounts, or Azure Enterprise Agreement. These provide 5-15% off your entire bill in exchange for a spending commitment. Savings: 5-15% on total spend.
-
38. Review third-party marketplace options. Some services are cheaper through the AWS Marketplace than direct from the vendor, and marketplace purchases count toward your EDP commitment. Check for databases, monitoring tools, and security products. Savings: varies, plus EDP credit.
For a detailed comparison of commitment options, read our reserved instances vs. pay-as-you-go guide.
Section 7: Governance and FinOps Practices (Prevents 15-30% Future Waste)
These items do not save money immediately. They prevent the waste from coming back.
The Checklist
-
39. Set up budget alerts at 80%, 100%, and 120% thresholds. Use AWS Budgets, GCP Budget Alerts, or Azure Cost Alerts. This is free and takes 10 minutes. Prevention: catches anomalies before they become $1,000+ surprises.
-
40. Enable anomaly detection. AWS Cost Anomaly Detection (free) automatically flags unusual spending patterns. GCP and Azure have equivalent features. Prevention: catches misconfigurations, runaway queries, and forgotten resources.
-
41. Enforce tagging via policy. Use AWS Service Control Policies, Azure Policy, or GCP Organization Policies to prevent resource creation without required tags. Untagged resources are invisible to cost attribution and almost always become waste. Prevention: ensures every dollar is attributable.
-
42. Run monthly cost review meetings. Bring engineering leads and finance together to review the top 10 cost drivers, month-over-month trends, and upcoming changes. Even a 30-minute monthly meeting reduces spending by 15-25% over time simply by creating awareness. Savings: 15-25% through behavioral change.
-
43. Track unit economics. Cost per customer, cost per API call, cost per transaction. If your cost-per-customer is rising while your customer count is flat, you have an efficiency problem. Unit economics make waste visible at the business level, not just the infrastructure level. Prevention: catches efficiency degradation early.
-
44. Include cost checks in CI/CD pipelines. Use Infracost to estimate the cost impact of infrastructure changes before they are deployed. A Terraform change that adds $2,000/month to your bill should be reviewed, not auto-approved. Prevention: stops expensive changes before they ship.
-
45. Implement a cloud request/approval process. Any new service, database, or environment creation above a cost threshold (say $100/month) should require a brief justification and approval. This is not about bureaucracy. It is about preventing the "I spun up an extra cluster to test something" that runs for six months unnoticed. Prevention: reduces resource sprawl.
-
46. Schedule quarterly optimization reviews. Cloud pricing changes, your workload patterns change, and new savings opportunities emerge. What was optimized six months ago may not be optimal today. Set a recurring calendar event. Prevention: maintains savings over time.
-
47. Assign clear cost ownership to engineering teams. When no one owns the bill, everyone ignores it. Implement showback or chargeback so each team sees and owns their portion of cloud spend. Teams that see their costs reduce spending 20-30% without being asked to. Savings: 20-30% through accountability.
How to Prioritize: The Effort vs. Impact Matrix
Not all 47 items are equal. Here is how to prioritize:
| Priority | Actions | Effort | Expected Savings |
|---|---|---|---|
| Do today (30 min each) | #5, #6, #17, #25, #39, #40 | Minimal | 10-20% immediate |
| Do this week (1-2 hrs each) | #1, #2, #3, #9, #12, #16, #21, #22, #34 | Low | 15-30% additional |
| Do this month (half day each) | #4, #7, #10, #11, #18, #24, #26, #28, #29, #30, #31, #35 | Medium | 10-25% additional |
| Implement over quarter | #8, #13, #14, #15, #19, #20, #23, #27, #32, #33, #36, #37, #41, #42, #43, #44, #45, #46, #47 | Higher | 5-15% ongoing prevention |
Start with the "do today" row. Those six actions take about 3 hours total and typically recover 10-20% of your cloud spend immediately.
The Numbers Behind This Checklist
Here is what a typical mid-size company ($20,000/month cloud bill) saves by working through each section:
| Section | Before | After | Monthly Savings |
|---|---|---|---|
| Compute | $8,000 | $5,200 | $2,800 |
| Database | $4,000 | $2,600 | $1,400 |
| Storage | $2,500 | $1,250 | $1,250 |
| Networking | $2,000 | $1,400 | $600 |
| Kubernetes overhead | $1,500 | $800 | $700 |
| Commitments (Savings Plans) | $2,000 | $1,400 | $600 |
| Total | $20,000 | $12,650 | $7,350 (37%) |
That is $88,200 per year back in your budget. Enough to hire another engineer, fund a new feature, or extend your runway by months.
Frequently Asked Questions
How often should I run through this checklist?
Do a full audit quarterly. Run the high-priority items (compute right-sizing, zombie cleanup, reservation review) monthly. Set up automated alerts and anomaly detection so the most critical issues surface automatically between reviews.
Can I automate most of these checks?
Yes. Tools like AWS Trusted Advisor, Kubecost, Infracost, and OpenCost automate many of these checks. Native cloud provider recommendations (Compute Optimizer, GCP Recommender, Azure Advisor) catch the most common issues automatically. But automated tools miss context. They cannot tell you whether a dev environment should be scheduled or deleted entirely. Human review still matters.
What if my team does not have time for monthly cost reviews?
Start with 15 minutes per week. Have one engineer review the top 5 cost anomalies in Cost Explorer during standup. That minimal investment catches most surprises. As your FinOps practice matures, formalize it into monthly reviews.
Should we hire a FinOps engineer or use a consultant?
If your cloud bill is under $50,000/month, a dedicated FinOps hire is hard to justify. A quarterly engagement with a cloud cost optimization partner delivers the same results at a fraction of the cost. Above $50,000/month, a dedicated FinOps engineer typically pays for themselves within the first quarter.
Does this checklist apply to multi-cloud environments?
Yes. Every section applies to AWS, Azure, and GCP. The specific tools differ (Cost Explorer vs. Cost Management vs. Cloud Billing), but the optimization principles are identical. Multi-cloud environments actually have more waste because resources duplicate across providers and visibility is fragmented. Read our multi-cloud cost optimization guide for provider-specific details.
What is the single most impactful item on this checklist?
Item #5: scheduling non-production environments to stop outside business hours. It requires minimal effort, carries zero risk, and typically saves 65-70% on non-production compute costs. For a company spending $5,000/month on dev/staging infrastructure, that is $3,250-3,500/month saved with about 2 hours of setup.
How do we prevent costs from creeping back up after optimization?
Items #39 through #47 (the governance section) exist specifically for this. Budget alerts catch immediate spikes. Anomaly detection catches gradual creep. Monthly reviews and unit economics tracking ensure savings persist. Without governance, optimization gains erode within 3-6 months as teams add new resources without cost awareness.
Get Started Now
You do not need to complete all 47 items to see results. Start with the six "do today" actions. Schedule your dev environments. Delete your zombie volumes and IPs. Turn on budget alerts and anomaly detection. Those actions alone will likely save 10-20% on your next bill.
Then work through the rest at your own pace, or bring in our team to run the full audit for you. We use this exact checklist (plus a few proprietary checks) when we optimize cloud infrastructure for our clients.
For ongoing cloud management after optimization, explore our cloud operations services. And if you are planning a migration that you want to get right the first time, check out our cloud migration approach.
Because every dollar you save on cloud waste is a dollar you can invest in building something that matters.