Back to Engineering Insights
Cloud Optimization
Jan 24, 2026
By LeanOps Team

Stop Burning Cloud Dollars: 7 Proven Steps to Detect Waste and Modernize Infrastructure in 2026

Stop Burning Cloud Dollars: 7 Proven Steps to Detect Waste and Modernize Infrastructure in 2026

The Cloud Bill Problem Nobody Wants to Admit

Here is a number your finance team will never show you in a board meeting: on average, 28% of cloud spend goes to resources that provide zero business value. Zero. Not "low value." Zero value.

That means if your company spends $200,000 per month on cloud, approximately $56,000 evaporates every single month into idle servers, orphaned storage, forgotten test environments, and misconfigured services nobody uses.

The Flexera 2025 State of the Cloud Report confirms this. And it has been true every single year since 2019. Despite every cloud provider adding new cost management tools, despite the entire FinOps industry growing to billions in revenue, despite thousands of teams "optimizing" their cloud spend every quarter, waste stays stubbornly at 25 to 35%.

Why? Because most teams treat cloud cost optimization as a cleanup project. They run a cost audit, fix the obvious things, declare victory, and move on. Six months later the waste is back, often worse than before.

This post gives you a different approach. Not a cleanup project. A permanent system. Seven steps that find your waste, eliminate it, prevent it from returning, and channel the savings directly into the infrastructure modernization that makes your systems faster, cheaper, and more resilient.

Let's start with what is actually driving your costs in 2026.


What Is Actually Driving Cloud Waste in 2026 (It Is Not What You Think)

Before we get into the solutions, you need to understand the problem correctly. Most cloud cost guides will tell you the obvious things: idle instances, oversized VMs, unused storage. Those are real, and we will cover them. But in 2026, three newer patterns are responsible for the fastest-growing portion of cloud waste.

Pattern 1: AI Workload Tail Spend

Teams now run AI and ML experiments constantly. Training jobs, embedding generation, fine-tuning runs, inference endpoint tests. Each one spins up compute, runs for a few hours, and then... stops running. But the infrastructure does not always stop with it.

GPU instances left running after a training job finishes. Inference endpoints provisioned for a demo that never got much traffic. Model artifacts duplicated across three S3 buckets because nobody agreed on a single location. Vector database instances sized for the production traffic you plan to have, not the development traffic you actually have.

AI tail spend is growing faster than any other category of cloud waste. And it is almost invisible in standard cost dashboards because it looks like "legitimate ML infrastructure."

Our guide on the hidden cost of AI cloud infrastructure covers this in depth.

Pattern 2: Microservices Data Transfer Explosion

As teams break monoliths into microservices, they gain flexibility. They also gain a bill they did not anticipate. Every service call that crosses an availability zone costs $0.01/GB each way. Every service that crosses regions costs $0.02/GB. Every byte that leaves your cloud costs $0.09/GB.

In a microservices architecture with 50 services communicating constantly, data transfer costs can easily reach $5,000 to $15,000/month. And they are almost always buried in "EC2-Other" or "Data Transfer" line items with no attribution to specific services.

Pattern 3: The Modernization Debt Tax

This is the one most teams do not see clearly. When you migrate a workload to the cloud without modernizing it (the classic "lift and shift"), you carry all the inefficiency of the old architecture into a pay-per-use environment.

A monolithic application that ran on a single on-premise server now runs on 6 oversized EC2 instances because the migration team provisioned for peak capacity and added redundancy "just in case." The same application, modernized and containerized, might run on 2 instances at a fraction of the cost.

That gap between "lift-and-shift cost" and "properly modernized cost" is what we call the modernization debt tax. It compounds monthly, and it never stops until you modernize.


Step 1: Do the Real Audit (Not the Dashboard Review)

Most cloud cost audits consist of someone opening Cost Explorer or Azure Cost Management, looking at the top 10 services, noting which ones went up, and scheduling a meeting to discuss it. That is not an audit. That is a viewing.

A real audit answers specific questions:

For every running resource:

  • Who owns this?
  • What workload does it serve?
  • What did it cost this month?
  • What is its current utilization (CPU, memory, storage I/O)?
  • When was it last used?

If you cannot answer all five questions for every significant resource, you do not have visibility. You have a bill.

How to Run the Real Audit

On AWS: Export your Cost and Usage Report to S3 and query it with Athena. This gives you line-item detail down to the individual resource ID, not just service-level totals. Join it with AWS Config data to see resource metadata (tags, creation date, last activity) alongside costs.

On Azure: Use Azure Resource Graph queries to list all resources across all subscriptions with their associated costs. Azure Cost Management's resource-level export provides the data; Resource Graph provides the metadata.

On GCP: Export billing data to BigQuery. GCP's billing export schema includes project-level, service-level, and SKU-level detail. Query it to find the specific resources driving your top 20 line items.

The time investment: This real audit takes 4 to 8 hours for a mid-sized environment. Most teams resist doing it because it feels like a lot of work. But the alternative is running a broken cleanup cycle forever.


Step 2: Score Your Waste Objectively

After the audit, you have data. Now you need a framework for deciding what to fix first.

Here is the waste scoring model we use. For each resource category, calculate a waste score based on two dimensions: financial impact and remediation difficulty.

CategoryHigh Impact, Easy Fix (Do Now)High Impact, Hard Fix (Plan)Low Impact, Easy Fix (Batch)
Idle computeUnattached VMs, orphaned instancesOversized prod databasesOversized dev databases
Ghost storageOrphaned EBS/disks, unused snapshotsRedundant data copies across regionsOld log archives
Data transferMissing VPC endpoints, cross-AZ patternsCross-cloud data movementCDN optimization
Commitment gapsObvious on-demand for stable workloadsComplex multi-cloud commitment strategyMinor regional gaps
AI/ML wasteIdle GPU instances, stale endpointsTraining pipeline redundancyExperiment artifact storage

Focus your first week on "High Impact, Easy Fix." These are your quick wins. They are typically responsible for 60% to 70% of total waste but take only 10% to 20% of the remediation effort.


Step 3: Kill Ghost Infrastructure Permanently

Ghost infrastructure is the most consistent source of waste across every cloud environment. Idle instances, orphaned storage, unused IPs, forgotten load balancers. We have covered this in detail in our 12-strategy guide to eliminating ghost servers, so here is the condensed version.

The three-step ghost elimination process:

  1. Find them: Run the quick-find commands (AWS: aws ec2 describe-volumes --filters Name=status,Values=available, Azure: az disk list --query "[?managedBy==null]", GCP: gcloud compute disks list --filter="-users:*"). These surface unattached disks alone, which typically save $500 to $5,000/month.

  2. Tag or terminate: Any resource without a documented owner gets a 48-hour notification window. If nobody claims it, terminate. This is not aggressive. This is hygiene.

  3. Automate prevention: Use expiry-date tags on all non-production resources. Any resource tagged with an expiry date that has passed gets automatically reviewed. This prevents the next wave of ghost infrastructure before it accumulates.

Typical impact: 8% to 15% reduction in total spend from ghost cleanup alone.


Step 4: Implement Scheduling for Non-Production (The 28% Find)

This step consistently surprises teams with how much it saves. Your dev, staging, QA, and sandbox environments run 24/7. Your team uses them for roughly 50 hours per week (and that is generous). The other 118 hours per week, these environments idle at full cost.

The math:

  • Non-production typically represents 35% to 45% of total cloud spend
  • 70% of the week is idle time
  • Scheduling non-production = 35% x 70% = 24.5% of total bill eliminated

Use AWS Instance Scheduler, Azure Automation runbooks, or GCP instance schedules. Start and stop EC2/VM instances, RDS/Cloud SQL databases, NAT Gateways, and load balancers based on business hours.

Do not just schedule the VMs. Schedule everything that bills hourly. A dev RDS instance on db.r6g.xlarge costs $365/month running 24/7. On a business hours schedule it costs $108/month. That $257/month difference per database adds up fast.

For the complete automation setup, read our guide on automated cloud cost optimization.


Step 5: Right-Size Before You Commit

This is the sequencing mistake that costs teams hundreds of thousands of dollars per year. They look at their stable workloads, decide to buy Reserved Instances or Savings Plans, and lock in a 1 or 3-year commitment.

Then they right-size a month later and realize they just reserved capacity for oversized instances.

The correct order is always: right-size first, then commit.

Here is why. If you right-size a fleet of m5.4xlarge instances (running at 15% CPU) down to m5.xlarge, you are now running 4x fewer vCPUs. Your optimal commitment is completely different. By committing before right-sizing, you either over-commit (paying for reserved capacity you will not use after right-sizing) or under-commit (not capturing the full discount on your new, smaller footprint).

The Right-Sizing Process

  1. Pull 30 days of CPU, memory, and network utilization data for every instance in your environment
  2. Identify any instance averaging below 40% CPU utilization
  3. Estimate the cost of dropping one instance size (typically 50% cost reduction per size tier for compute-optimized families)
  4. Test the smaller size in staging first
  5. Migrate production with a 1-hour rollback window

After right-sizing, calculate your stable baseline and commit to 70% of it with Savings Plans (AWS) or Reserved VM Instances (Azure) or Committed Use Discounts (GCP).

For more on the commitment decision, read our guide on reserved instances vs pay-as-you-go in 2026.


Step 6: Modernize the Workloads That Are Costing You the Most

This is where cloud cost optimization transforms from a cost-cutting exercise into a strategic initiative. The biggest cost reductions available to most organizations are not in cleaning up waste. They are in modernizing the architecture of expensive workloads.

Here is a typical comparison for a mid-sized SaaS application:

WorkloadLift-and-Shift CostContainerized CostServerless CostSavings
Web API (spiky traffic)$3,200/mo (4x m5.2xlarge)$1,400/mo (Kubernetes, right-sized)$200/mo (Lambda)38-94%
Background jobs (batch)$1,800/mo (always-on workers)$600/mo (K8s spot jobs)$80/mo (Lambda)67-96%
Database (underutilized)$730/mo (db.r6g.xlarge Multi-AZ)N/A$150/mo (Aurora Serverless)79%
ML inference endpoint$4,200/mo (GPU, always-on)$1,100/mo (K8s, scale-to-zero)$400/mo (serverless inference)74-90%

These are not theoretical. These are the savings ranges teams consistently achieve when they modernize the right workloads in the right order.

How to Prioritize Modernization

Start with your top 5 most expensive workloads. For each one, estimate the modernized cost. Sort by absolute savings opportunity. Modernize the biggest opportunities first.

The modernization projects with the fastest ROI are usually:

  1. Moving always-on batch jobs to spot/preemptible Kubernetes jobs
  2. Converting infrequently called APIs to serverless
  3. Moving dev/test databases from Multi-AZ to single-AZ (or serverless)
  4. Containerizing monolithic web applications

Our Kubernetes cost optimization guide and serverless cost optimization guide cover the implementation details for each path.

For full modernization support, explore our Cloud Migration and Modernization service.


Step 7: Build a System That Prevents the Next Wave of Waste

Every cost audit is temporary. Without preventive systems, waste returns. Here is the prevention stack that keeps waste permanently low:

Automated Ghost Detection (Weekly)

A Lambda function or Cloud Function runs every Sunday night. It finds orphaned resources across all accounts, snapshots volumes before deletion, removes them, and posts a summary to Slack. Cost of running this automation: approximately $0.05/week. Savings: $500 to $5,000/month.

Real-Time Anomaly Detection (Continuous)

AWS Cost Anomaly Detection, Azure Cost Alerts, and GCP Budget Alerts watch your spend 24/7 and alert you within hours of any abnormal spend. Setup time: 15 minutes per cloud provider. Cost: free on all three.

For the full setup guide, read our post on real-time cloud cost optimization.

Infrastructure Cost Gates (Every PR)

Add Infracost to your Terraform or IaC pipeline. Every pull request that modifies infrastructure gets an automatic cost estimate. Engineers see the cost impact before they merge. This catches expensive mistakes before they deploy.

Weekly FinOps Review (30 Minutes)

One meeting, 30 minutes, every week. Review the top 5 cost changes from the previous week. Discuss anomalies. Track active optimization initiatives. This meeting pays for itself in the first 15 minutes every single week.

Mandatory Tagging Enforcement

Use AWS Service Control Policies, Azure Policy, or GCP Organization Policies to deny resource creation without required tags (team, environment, service, cost-center). Untagged resources are unaccountable resources.


The Cloud Cost Optimization Checklist for 2026

Use this to audit your current state and prioritize your next 30 days:

Week 1: Find the Waste

  • Run a real audit: pull CUR/billing exports and query at resource level
  • Identify all unattached disks, unused IPs, and empty load balancers
  • List all non-production environments and their current running costs
  • Find every resource with no owner tag

Week 2: Quick Wins

  • Delete or snapshot orphaned storage volumes
  • Release unused Elastic IPs and static IPs
  • Delete load balancers with zero healthy targets
  • Set up non-production scheduling (off nights and weekends)

Week 3: Optimization

  • Enable anomaly detection on all cloud accounts (free)
  • Pull 30 days of utilization data and identify right-sizing candidates
  • Review and act on AWS Compute Optimizer / Azure Advisor / GCP Recommender
  • Add Infracost to at least one Terraform repository

Week 4: Prevention and Strategy

  • Enforce tagging with SCPs or equivalent policies
  • Right-size your top 10 most oversized instances
  • Identify your top 5 workloads by cost and estimate modernization savings
  • Schedule the first weekly FinOps review meeting

The Honest Talk About Cloud Cost Optimization in 2026

Let me say something that the FinOps tools and consulting firms will not say loudly: cloud cost optimization alone has a ceiling.

If you clean up all your waste, right-size everything, buy perfect commitments, and automate every possible optimization, you might save 35% to 45% on your current bill. That is significant. That is real money.

But the teams that save 60%, 70%, or even 80% on cloud costs? They modernized their architecture. They stopped lifting and shifting and started actually building for the cloud. They containerized their monoliths. They moved batch jobs to serverless. They stopped provisioning for peak and started designing for elasticity.

Cost optimization and infrastructure modernization are not separate initiatives. They are the same initiative, approached from different angles. Optimization frees up the budget. Modernization locks in permanent savings that no amount of cleanup can achieve.

The seven steps in this post cover both. Start with the cleanup. Build the prevention system. Then invest the savings into the modernization projects that make the savings permanent.

That is how you stop burning cloud dollars for good.

Want to find out exactly how much you are overpaying and which workloads are worth modernizing first? Take our free Cloud Waste and Risk Scorecard for a personalized assessment in under 5 minutes.

For expert help building and executing your optimization and modernization strategy, explore our Cloud Cost Optimization and FinOps service.


Related reading:

External resources: