AWS Bahrain Outage: Emergency Migration to us-east-1 in 48 Hours

Introduction

On March 1, 2026, AWS confirmed that drone strikes had damaged its Middle East data centers in UAE (me-central-1) and Bahrain (me-south-1), causing one of the largest cloud outages in AWS history.

We were running a 100K-user Open edX platform in me-south-1. This is the story of how we migrated to us-east-1 in 48 hours.

AWS Middle East region map showing Bahrain and UAE data centers AWS Middle East region — me-south-1 (Bahrain) was our primary production region

What Failed

At 03:14 UTC, all EC2 instances in me-south-1 became unreachable. RDS went into a failover loop. S3 cross-region replication to us-east-1 was our only saving grace — course content and media were already mirrored.

The platform was completely down for 6 hours before we made the call: do not wait for AWS, migrate now.

Migration Steps

Hour 0–6: Assessment

Confirmed S3 data was intact in us-east-1
Took latest RDS snapshot (from 2 hours before outage)
Inventoried all services: EC2, RDS, ElastiCache, ALB, Route 53

Hour 6–24: Rebuild in us-east-1

Spun up new VPC with matching CIDR blocks, created new EC2 instances using our Ansible playbooks, restored RDS from snapshot.

# Restore RDS snapshot to new region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier platform-prod-restored \
  --db-snapshot-identifier rds:platform-prod-2026-03-01-01-00 \
  --db-instance-class db.r6g.xlarge \
  --region us-east-1

Hour 24–48: DNS Cutover

Updated Route 53 health checks, pointed ALB to new instances, changed TTL to 60 seconds, then flipped DNS.

Terminal showing AWS CLI migration commands running Running emergency migration scripts at 2AM during the outage

What We Learned

S3 cross-region replication is non-negotiable. It saved us. Everything else could be rebuilt. Data cannot.
Ansible playbooks for everything. We rebuilt 12 EC2 instances in 4 hours because every config was in code.
Cold standby is not enough. We now run a warm standby in us-east-1 — spun down but ready to go live in 30 minutes.
RTO vs RPO. We had 2-hour RPO (last snapshot). Acceptable. RTO was 42 hours. Not acceptable. Now targeting 4 hours.

AWS infrastructure failover animation Multi-region failover in action — what we wished we had set up earlier

Current DR Setup

After this incident we implemented:

Warm standby in us-east-1 (always running, scaled down)
RDS Multi-AZ with cross-region read replica
Route 53 health checks with automatic failover
S3 cross-region replication on every bucket
Monthly DR drills — we actually cut over and back every quarter