Cloud Infrastructure

Infrastructure Cost Reduction: $20K Monthly Savings

Challenge: Infrastructure costs had been trending upward over several months, with a sharp spike triggering company-wide cost review. Executive leadership needed rapid identification of cost drivers, actionable recommendations, and measurable results without impacting service quality or development velocity.

← Back to Case Studies


Executive Summary

Identified and eliminated infrastructure waste through data-driven analysis, achieving $20K monthly savings ($240K annualized) without impacting performance or availability. Reduced AWS infrastructure costs by 25% and DataDog observability costs by 50% while maintaining 99.95%+ uptime for revenue-generating systems.

Key Results:

  • $20,000 monthly savings sustained ($240K annualized)
  • 25% AWS reduction maintaining 99.95%+ uptime
  • 50% DataDog reduction without degrading observability
  • 75% log volume reduction improving signal-to-noise

The Challenge

###Context

Situation:

  • Infrastructure costs had been trending upward over several months
  • Sharp spike in most recent month triggered company-wide cost review
  • Executive leadership launched cost optimization initiative
  • All teams asked to identify and eliminate waste

Constraints:

  • Cannot impact system performance
  • Cannot reduce availability of revenue-generating systems
  • Must maintain observability and incident response capabilities
  • Changes must be low-risk and quickly reversible

Business Pressure:

  • Post-acquisition integration ongoing
  • Cost scrutiny from parent company
  • Need to demonstrate financial discipline
  • Every dollar saved = direct impact to profitability

The Approach

Data-Driven Analysis

AWS Infrastructure:

  • Used AWS Trusted Advisor Cost Optimization recommendations
  • Analyzed EC2, RDS, and storage utilization patterns
  • Identified oversized resources with low utilization
  • Prioritized low-risk, high-impact changes

DataDog Observability:

  • Analyzed log ingestion patterns
  • Investigated cost spike root causes
  • Reviewed log retention policies
  • Assessed value of different log types

Risk Assessment:

  • Categorized changes by risk level (low/medium/high)
  • Focused on low-risk optimizations first
  • Established rollback procedures
  • Defined monitoring metrics for validation

AWS Infrastructure Optimization

Analysis Process

Utilization Review:

  • EC2 instances averaging 20-30% CPU utilization
  • RDS instances with excessive provisioned IOPS
  • Orphaned EBS volumes from terminated instances
  • Old snapshots retained indefinitely
  • Elastic IPs not in use

Cost Attribution:

  • Tagged resources by team and application
  • Identified highest-cost components
  • Calculated cost per transaction for key services
  • Benchmarked against industry standards

Actions Taken

EC2 Right-Sizing:

  • Analyzed CloudWatch metrics for 30-day utilization
  • Identified instances with consistent low CPU/memory
  • Proposed downsizing plan to Cloud Engineering team
  • Executed changes during maintenance windows
  • Monitored for performance impact

RDS Optimization:

  • Reviewed IOPS utilization vs. provisioned
  • Reduced provisioned IOPS on underutilized databases
  • Changed instance types for better price/performance
  • Consolidated small databases where appropriate

Storage Cleanup:

  • Deleted orphaned EBS volumes (unattached for 90+ days)
  • Implemented snapshot lifecycle policies
  • Removed old AMIs no longer needed
  • Released unused Elastic IPs

Reserved Instance Review:

  • Analyzed instance types and usage patterns
  • Purchased Reserved Instances for predictable workloads
  • Converted some On-Demand to Spot where appropriate

Results: AWS Optimization

Cost Reduction:

  • 25% overall AWS cost reduction
  • $15K monthly savings sustained
  • $180K annualized savings

Performance:

  • 99.95%+ uptime maintained throughout optimization
  • No customer-facing performance impact
  • Latency metrics remained stable
  • Error rates unchanged

Operational Impact:

  • Reduced resource sprawl
  • Improved cost visibility through better tagging
  • Established ongoing optimization culture
  • Created baseline for future capacity planning

DataDog Cost Optimization

Root Cause Investigation

Cost Spike Discovery:

  • DataDog costs increased 80% in single month
  • Log ingestion volume tripled unexpectedly
  • Storage and processing costs spiked

Analysis:

  • Reviewed top log sources by volume
  • Identified Cloudflare WAF logs as primary culprit
  • Surge in blocked traffic (bot/scraper activity)
  • Low-value logs consuming significant resources

Business Impact:

  • Observability budget exceeded significantly
  • Risk of losing visibility if costs unsustainable
  • Needed solution without degrading incident response

Optimization Strategy

Log Filtering:

  • Implemented client-side filtering in Cloudflare
  • Added server-side sampling for high-volume logs
  • Updated DataDog agent configurations
  • Removed redundant logging across applications

Log Retention:

  • Reviewed retention policies by log type
  • Reduced retention for debug logs (7 days)
  • Maintained longer retention for security logs (30 days)
  • Archived critical logs to S3 (long-term storage)

Log Value Assessment:

  • Categorized logs by operational value
  • Eliminated logs never queried in 90 days
  • Consolidated duplicate logging
  • Improved signal-to-noise ratio

Results: DataDog Optimization

Cost Reduction:

  • 50% DataDog cost reduction
  • $5K monthly savings
  • $60K annualized savings

Log Volume:

  • 75% reduction in ingestion volume
  • Improved query performance (less data to search)
  • Faster dashboard load times
  • Better signal-to-noise ratio

Observability Maintained:

  • No degradation in incident response capabilities
  • All critical monitoring retained
  • Improved dashboard clarity (less noise)
  • Faster troubleshooting (better signal quality)

Combined Impact

Financial Results

Monthly Savings:

  • AWS: $15,000/month
  • DataDog: $5,000/month
  • Total: $20,000/month sustained

Annualized:

  • AWS: $180,000/year
  • DataDog: $60,000/year
  • Total: $240,000/year

ROI:

  • Time invested: ~40 hours (analysis + implementation)
  • Payback: Immediate (first month)
  • Ongoing: Zero maintenance overhead
  • Return: Infinite (one-time effort, continuous benefit)

Operational Benefits

Cost Visibility:

  • Better tagging and attribution
  • Clearer understanding of cost drivers
  • Established baseline for future optimization
  • Regular cost review process implemented

Resource Efficiency:

  • Eliminated waste and sprawl
  • Right-sized infrastructure for actual load
  • Improved utilization metrics
  • Better capacity planning

Observability Quality:

  • Improved signal-to-noise ratio
  • Faster troubleshooting
  • Better dashboard performance
  • More focused alerting

Methodology & Best Practices

Data-Driven Decision Making

Measurement First:

  • Establish baseline metrics before changes
  • Use actual utilization data, not assumptions
  • Monitor impact of changes continuously
  • Validate savings with billing reports

Risk Management:

  • Start with low-risk, high-impact changes
  • Test changes in non-production first
  • Have rollback procedures ready
  • Monitor closely after implementation

Stakeholder Communication:

  • Share analysis findings transparently
  • Explain trade-offs clearly
  • Set realistic expectations
  • Provide regular progress updates

Optimization Principles

Right-Sizing Philosophy:

  • Size for actual load, not hypothetical peaks
  • Use auto-scaling for variable workloads
  • Reserve capacity for truly stable workloads
  • Review utilization quarterly

Log Management Strategy:

  • Log with purpose (not “just in case”)
  • Sample high-volume, low-value logs
  • Retain based on operational need
  • Archive for compliance, don’t stream

Continuous Optimization:

  • Make cost optimization ongoing, not one-time
  • Review costs monthly
  • Automate cleanup where possible
  • Culture of cost awareness

Lessons Learned

What Worked Well

AWS Trusted Advisor:

  • Provided actionable recommendations
  • Low-hanging fruit identified quickly
  • Validated by utilization data
  • Easy to prioritize by impact

Team Delegation:

  • Cloud Engineering team executed changes
  • Systems team handled database optimizations
  • Distributed work = faster completion
  • Ownership by experts = better outcomes

Quick Wins:

  • Early successes built momentum
  • Demonstrated ROI quickly
  • Gained executive support for deeper work

What I’d Do Differently

Earlier Action:

  • Should have reviewed costs proactively
  • Waited for spike instead of preventing it
  • Quarterly cost reviews should be standard

More Automation:

  • Manual cleanup of orphaned resources
  • Should have automated snapshot policies earlier
  • Tagging enforcement should have been automated

Better Forecasting:

  • Cost spike surprised us
  • Better trending and alerting needed
  • Predictive analysis would have caught earlier

What This Demonstrates

For Cost Optimization Roles:

  • Data-driven analysis methodology
  • Low-risk, high-impact prioritization
  • $240K annualized savings delivered
  • Sustainable optimization culture established

For SRE / Infrastructure Roles:

  • Cloud cost optimization expertise
  • Observability best practices
  • Resource right-sizing experience
  • Performance-cost trade-off management

For Technical Leadership Roles:

  • Financial discipline and accountability
  • Stakeholder communication
  • Risk management
  • Execution through delegation

For FinOps / Cloud Governance Roles:

  • AWS cost optimization experience
  • SaaS spend management (DataDog)
  • Tagging and attribution strategy
  • Continuous optimization processes

Technologies & Tools

Cloud Platform:

  • AWS (EC2, RDS, S3, CloudWatch)
  • AWS Trusted Advisor
  • AWS Cost Explorer
  • AWS CloudWatch metrics

Observability:

  • DataDog (logs, metrics, APM)
  • CloudWatch (AWS native)
  • Custom dashboards

Analysis Tools:

  • Excel for cost modeling
  • PowerShell for automation
  • AWS CLI for bulk operations
  • DataDog API for analytics

Ongoing Impact

Sustained Savings

6+ Months Later:

  • Savings maintained fully
  • No performance degradation
  • No availability issues
  • Culture of cost awareness established

Process Changes:

  • Monthly cost reviews now standard
  • Automated cleanup policies in place
  • Better tagging compliance
  • Proactive optimization mindset

Contact

Need help identifying and eliminating infrastructure waste? Let’s discuss data-driven cost optimization strategies that protect performance while reducing spend.

Get in Touch: stevenleve.com/contact
LinkedIn: linkedin.com/in/steve-leve


← Back to Case Studies