Back to Engineering Insights
Cloud Cost Optimization
May 8, 2026
By Ravi Kanani

Terraform Drift Is Silently Adding $8K-40K/Year to Your Cloud Bill (Here's How to Find It)

Terraform Drift Is Silently Adding $8K-40K/Year to Your Cloud Bill (Here's How to Find It)
Key Takeaway

Terraform drift causes $8K-40K/year in hidden cloud waste for the average 50-100 resource AWS account. The top offenders: manually created instances never added to state ($2K-8K/year), orphaned EBS volumes from failed destroys ($500-3K/year), and security groups with attached ENIs blocking cleanup ($200-1K/year). A weekly 10-minute drift audit catches 90% of cost leaks before they compound.

The $40K Your Terraform State File Doesn't Know About

Here is a scenario we encounter in nearly every cloud cost assessment at LeanOps: a team manages their infrastructure with Terraform, has proper CI/CD for deployments, reviews pull requests on HCL changes, and still overspends by 15-30% on resources that Terraform has no idea exist.

The reason is drift. Not the kind that shows up in terraform plan (that is state drift, and most teams catch it). The expensive kind: resources provisioned outside Terraform entirely, never added to state, never tracked, and never cleaned up.

A developer spins up an RDS instance via the console to debug a production issue. An SRE creates a load balancer manually during an incident. A data engineer launches a large EC2 instance for a one-time migration and forgets to terminate it. An auto-scaling group creates instances that Terraform does not manage. Each of these is invisible to your IaC pipeline. They do not show up in terraform plan. They do not appear in your module outputs. They simply run, bill, and accumulate.

Across 30 AWS accounts we have audited in the past year, the average drift-caused waste is $8,000 to $40,000 per year. For larger environments (500+ managed resources), we have found drift waste exceeding $150,000 annually. The worst part: these resources often run for months or years before anyone notices, because they exist in a governance blind spot between "infrastructure the team manages" and "infrastructure that exists."

This post covers the 7 most expensive drift patterns, how to detect each one in under 10 minutes, and the exact playbook to prevent drift from becoming a recurring cost leak.


The 7 Most Expensive Terraform Drift Patterns

Not all drift is expensive. A drifted tag or a modified description costs nothing. These seven patterns are the ones that consistently show up as significant line items on cloud bills.

Pattern 1: Manually Provisioned Compute (Cost: $2,000-8,000/year)

What happens: An engineer creates an EC2 instance, ECS service, or Lambda function via the AWS Console or CLI for debugging, testing, or a quick fix. The intent is temporary. The resource becomes permanent because no one tracks it, no terraform destroy removes it, and billing alerts are set at the account level (not the resource level).

Why it persists: The resource has no Terraform state entry, so terraform plan never shows it. It is not tagged with an owning team or expiration date. The engineer who created it moves to a different project. Monthly cost reviews look at aggregate spending, not individual resource inventories.

Real example: We found a client running 3x m5.xlarge instances ($0.192/hour each) that were created 14 months earlier for a load test. Total waste: $4,147 over 14 months. Nobody noticed because the instances represented less than 5% of the account's total EC2 spend.

Detection:

# Find EC2 instances not managed by Terraform
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[?!Tags[?Key==`terraform`]].[InstanceId,InstanceType,LaunchTime]' \
  --output table

Pattern 2: Orphaned EBS Volumes (Cost: $500-3,000/year)

What happens: Terraform destroys an EC2 instance but the EBS volume has delete_on_termination = false (which is the default for additional volumes in many modules). The instance disappears from state. The volume remains, unattached, billing monthly.

Why it persists: Unattached EBS volumes generate no CloudWatch metrics, no alarms, and no alerts. They sit in "available" state indefinitely. At $0.10/GB/month for gp3, a 500GB volume costs $50/month ($600/year) doing absolutely nothing.

Typical accumulation: Teams running ephemeral workloads (CI runners, batch processing, dev environments) commonly accumulate 20-50 orphaned volumes over a year. At an average of 100GB each: 30 volumes x 100GB x $0.10 = $300/month = $3,600/year.

Detection:

# Find unattached EBS volumes with total cost
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].[VolumeId,Size,CreateTime]' \
  --output table

Pattern 3: Forgotten Load Balancers (Cost: $2,000-5,000/year)

What happens: An ALB or NLB is created for a service that later gets decommissioned. The Terraform module for the service is destroyed, but the load balancer was in a separate module or was created manually. It continues running with no targets, processing no traffic, but billing the fixed hourly rate.

Why it costs so much: An idle ALB costs $16.20/month in fixed charges ($0.0225/hour) plus $0.008/LCU-hour even at minimum. An NLB costs $16.20/month minimum. Over a year, a single forgotten ALB costs $194. But most environments have 3-8 forgotten load balancers: that is $600-1,550/year in fixed charges alone, plus the associated Elastic IPs, target groups, and WAF rules that often attach to them.

Detection:

# Find ALBs with no healthy targets
aws elbv2 describe-target-health \
  --query 'TargetHealthDescriptions[?TargetHealth.State!=`healthy`]' \
  --output json

Pattern 4: Stale NAT Gateways (Cost: $1,000-4,000/year)

What happens: A VPC is created with NAT Gateways for private subnet internet access. The workloads in those private subnets are later moved or decommissioned. The NAT Gateway remains because it is in a shared networking module that nobody wants to touch.

Cost structure: NAT Gateways cost $0.045/hour ($32.40/month) per gateway in fixed charges, plus $0.045/GB processed. Even with zero traffic, two NAT Gateways (one per AZ for HA) cost $64.80/month = $778/year. With even modest residual traffic (DNS lookups, health checks from resources in the subnet), the bill climbs to $100-150/month = $1,200-1,800/year.

Detection:

# Find NAT Gateways with < 1GB processed in last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --period 604800 --statistics Sum \
  --start-time $(date -v-7d +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date +%Y-%m-%dT%H:%M:%S) \
  --dimensions Name=NatGatewayId,Value=nat-XXXXX

Pattern 5: Elastic IPs Without Attachments (Cost: $200-800/year)

What happens: Elastic IPs are allocated for instances or load balancers that are later terminated. The EIP remains allocated but unattached. AWS charges $0.005/hour ($3.60/month) for unattached EIPs.

Why it accumulates: Individual EIP cost ($3.60/month) is small enough to never trigger a billing alarm. But teams commonly accumulate 5-20 unused EIPs over time: that is $18-72/month = $216-864/year. Additionally, AWS limits EIPs to 5 per region by default, so orphaned EIPs can block new allocations, forcing limit increase requests.

Detection:

# Find all unassociated Elastic IPs
aws ec2 describe-addresses \
  --query 'Addresses[?!AssociationId].[PublicIp,AllocationId]' \
  --output table

Pattern 6: Oversized RDS Instances from Manual Scaling (Cost: $3,000-15,000/year)

What happens: During a traffic spike or performance issue, someone manually scales an RDS instance from db.r6g.large to db.r6g.2xlarge via the console. The crisis passes. The Terraform state still shows the old instance class. Nobody scales it back down because terraform plan shows a "change" that would cause downtime, and nobody wants to schedule the maintenance window.

Cost impact: The difference between db.r6g.large ($0.26/hr) and db.r6g.2xlarge ($0.52/hr) is $0.26/hour = $189.80/month = $2,278/year. For Multi-AZ (which doubles the cost), the drift waste is $4,556/year on a single instance. We have found clients with 3-5 manually scaled RDS instances that were never scaled back: $7,000-23,000/year in avoidable spend.

Detection:

# Compare Terraform state to actual RDS instance classes
terraform state show aws_db_instance.main | grep instance_class
aws rds describe-db-instances \
  --query 'DBInstances[].[DBInstanceIdentifier,DBInstanceClass]' \
  --output table

Pattern 7: Abandoned CloudWatch Log Groups (Cost: $500-5,000/year)

What happens: Terraform creates a service with CloudWatch Logs. The service is decommissioned, but the log group persists (log groups are not automatically deleted). Old logs accumulate in storage. With no retention policy set, logs grow indefinitely at $0.03/GB/month for storage.

Why it compounds: A single application generating 1GB/day of logs, running for a year with no retention policy, accumulates 365GB of stored logs = $10.95/month in perpetual storage. Multiply by 10-20 abandoned services, and storage costs reach $100-200/month = $1,200-2,400/year. The insidious part: the log ingestion stopped (no new charges), but the storage bill grows every month from historical data that nobody will ever query.

Detection:

# Find log groups with no new events in 30+ days but significant stored data
aws logs describe-log-groups \
  --query 'logGroups[?storedBytes > `1000000000`].[logGroupName,storedBytes,retentionInDays]' \
  --output table

The 10-Minute Weekly Drift Audit

You do not need expensive tooling to catch 90% of cost drift. This workflow takes 10 minutes every Monday morning and catches the patterns above before they compound.

Step 1: Run Terraform Plan in CI (2 minutes)

Add a scheduled terraform plan to your CI pipeline that runs every Monday at 9 AM. This catches state drift (resources that exist in Terraform but have changed).

# .github/workflows/drift-detection.yml
name: Weekly Drift Detection
on:
  schedule:
    - cron: "0 9 * * 1"
jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -detailed-exitcode
        continue-on-error: true
      - name: Notify on drift
        if: steps.plan.outcome == 'failure'
        run: |
          curl -X POST $SLACK_WEBHOOK \
            -d '{"text":"Terraform drift detected. Run terraform plan to review."}'

Step 2: Scan for Unmanaged Resources (3 minutes)

Use a script that compares resources in your AWS account against resources in your Terraform state.

#!/bin/bash
# quick-drift-scan.sh

echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes --filters "Name=status,Values=available" \
  --query 'Volumes[].[VolumeId,Size,CreateTime]' --output table

echo "=== Unassociated Elastic IPs ==="
aws ec2 describe-addresses --query 'Addresses[?!AssociationId].[PublicIp,AllocationId]' --output table

echo "=== Running instances without terraform tag ==="
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[?!Tags[?Key==`ManagedBy` && Value==`terraform`]].[InstanceId,InstanceType,LaunchTime]' \
  --output table

echo "=== Load Balancers with no healthy targets ==="
for arn in $(aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn' --output text); do
  healthy=$(aws elbv2 describe-target-health --target-group-arn $arn 2>/dev/null | grep -c "healthy")
  if [ "$healthy" -eq 0 ]; then
    echo "No healthy targets: $arn"
  fi
done

Step 3: Review and Action (5 minutes)

For each finding:

  1. If the resource is needed: Import it into Terraform state (terraform import) and add it to your HCL.
  2. If the resource is not needed: Terminate/delete it immediately. Do not "plan to clean it up later." Later never comes.
  3. If you are not sure: Tag it with drift-review: 2026-05-15 (one week from now). If nobody claims it by then, delete it.

The Drift Prevention Framework

Detection is reactive. Prevention is cheaper. These five practices stop drift from accumulating in the first place.

Practice 1: Enforce Console Read-Only in Production

Use AWS Organizations SCPs to make production accounts console-read-only. Engineers can view resources but cannot create, modify, or delete them without going through Terraform.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyConsoleWritesInProd",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateVolume",
        "rds:CreateDBInstance",
        "elasticloadbalancing:CreateLoadBalancer"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotLike": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/TerraformRole"
        }
      }
    }
  ]
}

This eliminates Pattern 1 (manual compute), Pattern 3 (manual LBs), and Pattern 6 (manual RDS scaling) entirely. The policy allows the Terraform execution role to make changes but blocks console users.

Practice 2: Tag Everything at Creation with Source

Add a mandatory ManagedBy tag to all resources. Use AWS Config rules to flag any resource created without this tag.

Tag KeyValuesPurpose
ManagedByterraform, manual, auto-scalingIdentifies governance path
TerraformModulemodule pathLinks resource to code
ExpiresAtISO dateAuto-cleanup trigger
Ownerteam emailAccountability

Resources tagged ManagedBy: manual get flagged in weekly audits. Resources tagged with ExpiresAt get auto-terminated by a Lambda function when the date passes.

Practice 3: Auto-Delete Temporary Resources

Deploy a simple Lambda function that runs daily and terminates resources past their expiration:

import boto3
from datetime import datetime

def handler(event, context):
    ec2 = boto3.client('ec2')
    today = datetime.now().strftime('%Y-%m-%d')

    instances = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:ExpiresAt', 'Values': [today]},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            ec2.terminate_instances(InstanceIds=[instance['InstanceId']])
            print(f"Terminated expired instance: {instance['InstanceId']}")

This prevents temporary resources from becoming permanent cost leaks. Engineers must set an ExpiresAt tag when creating anything outside Terraform. If they forget, the weekly audit catches it.

Practice 4: Terraform State Reconciliation in CI

Run a state reconciliation step in your deploy pipeline that compares the expected resource count against the actual resource count:

# Count resources in state
STATE_COUNT=$(terraform state list | wc -l)

# Count resources in AWS (for the tagged resources)
AWS_COUNT=$(aws resourcegroupstaggingapi get-resources \
  --tag-filters Key=ManagedBy,Values=terraform \
  --query 'ResourceTagMappingList | length(@)')

DRIFT=$((AWS_COUNT - STATE_COUNT))
if [ $DRIFT -gt 5 ]; then
  echo "WARNING: $DRIFT resources in AWS not in Terraform state"
fi

Practice 5: Monthly Cost Attribution Review

Once per month, compare your Terraform-managed costs against your total account costs. The gap is your drift waste.

# Total account cost (last month)
TOTAL=$(aws ce get-cost-and-usage \
  --time-period Start=2026-04-01,End=2026-04-30 \
  --granularity MONTHLY --metrics BlendedCost \
  --query 'ResultsByTime[0].Total.BlendedCost.Amount')

# Cost of tagged (terraform-managed) resources
MANAGED=$(aws ce get-cost-and-usage \
  --time-period Start=2026-04-01,End=2026-04-30 \
  --granularity MONTHLY --metrics BlendedCost \
  --filter '{"Tags":{"Key":"ManagedBy","Values":["terraform"]}}' \
  --query 'ResultsByTime[0].Total.BlendedCost.Amount')

echo "Total: $TOTAL | Managed: $MANAGED | Gap (drift): $(($TOTAL - $MANAGED))"

If the gap exceeds 10% of total spend, you have a drift problem worth investigating.


Drift Cost Calculator: What Is Your Environment Leaking?

Use this table to estimate your drift waste based on environment size:

Environment SizeTypical Drift ResourcesEstimated Annual WasteCommon Offenders
Small (10-50 resources)2-5 orphaned resources$1,000-5,000/yearEBS volumes, EIPs, small instances
Medium (50-200 resources)5-15 orphaned resources$5,000-20,000/year+ Load balancers, NAT gateways, log groups
Large (200-500 resources)15-40 orphaned resources$20,000-80,000/year+ RDS instances, ECS services, S3 buckets
Enterprise (500+ resources)40-100+ orphaned resources$50,000-200,000/year+ Cross-account drift, multi-region duplication

The multiplier effect: Drift compounds over time. A single untracked resource costs X/month today. Without detection, similar resources accumulate at a rate of 1-3 per month. After 12 months, you are paying 12-36x that original resource cost.


Tools for Drift Detection (Free and Paid)

ToolCostWhat It DetectsBest For
terraform plan (scheduled)FreeState drift onlyTeams already using Terraform
AWS Config Rules~$2/rule/monthCompliance drift, untagged resourcesAWS-native governance
Driftctl (open source)FreeUnmanaged resources outside TerraformFinding resources Terraform doesn't know about
Spacelift$40/user/monthDrift + policy + costTeams needing full IaC governance
env0Custom pricingDrift + cost estimation + policyEnterprise IaC platforms
FireflyCustom pricingFull cloud-to-code mappingMulti-cloud IaC governance
Snyk IaCFree tier availableDrift + security misconfigsSecurity-focused teams

Our recommendation: Start with scheduled terraform plan + the bash script above (free, 10 minutes/week). If drift exceeds $10K/year or you have 200+ resources, invest in Driftctl or Spacelift for automated detection.


The ROI of Drift Detection

InvestmentTime/CostExpected Annual SavingsROI
Weekly manual script (10 min/week)8.7 hours/year (~$1,300 eng time)$8,000-40,000/year6-30x
Driftctl (automated, open source)4 hours setup + 1 hr/month$15,000-60,000/year10-40x
Spacelift ($40/user x 5 users)$2,400/year$30,000-100,000/year12-42x

Even the simplest approach (a 10-minute weekly script) delivers 6-30x return on time invested. There is no scenario where drift detection does not pay for itself within the first month.


The Bottom Line

Terraform drift is the cloud cost equivalent of a slow water leak. It does not cause a flood. It does not trigger alarms. It just runs, bills, and compounds month after month until someone finally looks at the pipes.

The fix is not complex. A 10-minute weekly audit catches 90% of drift waste. An SCP policy preventing console writes prevents 60% of drift from occurring in the first place. Together, they save $8K-40K/year for a typical environment with zero risk and minimal effort.

If your cloud bill has been growing faster than your workload, drift is likely a contributing factor. Our cloud cost optimization team includes drift detection as part of every assessment, and we typically find $10K-50K in drift waste within the first week.

Start with the 10-minute audit script above. Run it next Monday. The results will surprise you.


Further reading:

Frequently Asked Questions

Stop Overpaying for Cloud Infrastructure

Our clients save 30-60% on their cloud bill within 90 days. Get a free Cloud Waste Assessment and see exactly where your money is going.