The Problem
When I took over the ALW platform infrastructure, the AWS bill was growing every month with no clear ownership of what was driving costs. We were running 100,000+ concurrent users across Open edX on AWS me-south-1 (Bahrain) and the team had no cost visibility at all.
Here is exactly what I found and fixed — resulting in a 60% reduction in monthly AWS spend.
AWS Cost Explorer — your first stop before any optimization work
Step 1 — Get Visibility First
Before cutting anything, I set up proper cost allocation tags across every resource:
- Environment (prod / staging / dev)
- Platform (alw / edly / ilmx)
- Team (devops / backend / frontend)
Without tags, you are flying blind. Enable Cost Explorer and set up a monthly budget alert at 80% threshold immediately.
Step 2 — EC2 Rightsizing
Used AWS Compute Optimizer to identify overprovisioned instances. Found that most application servers were running at under 20% CPU utilization consistently.
Actions taken:
- Downgraded m5.xlarge → m5.large for app servers (saving 50% on compute)
- Switched dev/staging to t3.medium burstable instances
- Scheduled non-prod instances to stop at 8PM and restart at 8AM daily
Saving: ~35% of EC2 bill
CloudWatch metrics revealing consistently low CPU utilization on overprovisioned instances
Step 3 — Reserved Instances for Production
Production workloads run 24/7 with predictable load. Switched from On-Demand to 1-year Reserved Instances for all production EC2 and RDS.
Saving: ~40% on reserved resources
Step 4 — CloudFront for Static Assets
Open edX serves a huge amount of static content — course videos, images, JS/CSS. Was serving all of it directly from S3 with no CDN, paying full data transfer costs.
Set up CloudFront distribution in front of S3:
- Cache TTL set to 1 year for versioned assets
- Saudi Arabia and Pakistan edge locations reduced latency by 60%
- Data transfer costs dropped dramatically
Saving: ~25% of data transfer costs
Step 5 — S3 Lifecycle Policies
Found 2TB+ of old course backups, log files, and unused media sitting in S3 Standard storage.
Applied lifecycle rules:
- Move to S3 Infrequent Access after 30 days
- Move to Glacier after 90 days
- Delete logs older than 1 year
Saving: ~40% of S3 bill
Step 6 — Kill Hidden Resources
Ran a full audit with AWS Trusted Advisor and found:
- 12 unattached EBS volumes still being charged
- 8 unused Elastic IPs
- 3 idle Load Balancers with zero traffic
- Old NAT Gateway in a dev VPC nobody was using
Deleted all of them immediately.
Saving: $200+/month in hidden waste
Final Result
| Category | Before | After | Saving |
|---|---|---|---|
| EC2 Compute | $X | $X | 45% |
| RDS Database | $X | $X | 38% |
| S3 Storage | $X | $X | 40% |
| Data Transfer | $X | $X | 55% |
| Hidden Resources | $X | $0 | 100% |
| Total | $X | $X | 60% |
The feeling when the AWS bill drops 60% 💸
Key Takeaways
- Tag everything before you optimize anything
- Rightsizing gives the fastest win with lowest risk
- Reserved Instances are worth it for stable production workloads
- CloudFront pays for itself on any platform with heavy static assets
- Run a hidden resource audit every quarter — waste accumulates silently