AWS EC2 GPU instance running AI model inference with cost monitoring dashboard
← All Articles
AWS

How to Deploy AI Models on AWS EC2 with GPU (Step-by-Step Cost Guide)

Why Run AI on EC2 GPU Instances

AWS Bedrock handles most AI workloads without infrastructure management. But some scenarios require your own GPU instance:

  • Running open-source models not available on Bedrock
  • Fine-tuning models on proprietary data
  • Serving models with custom inference pipelines
  • Latency-sensitive applications needing dedicated capacity
  • Cost optimization for high-volume inference (24/7 workloads)

The challenge is that AWS GPU instances are confusing. There are 6+ instance families, prices range from $0.50 to $32 per hour, and picking the wrong one wastes thousands of dollars per month.

This guide cuts through the confusion with real benchmarks and cost analysis.

Circuit board representing GPU computing hardware Choosing the right EC2 GPU instance for AI saves hundreds of dollars per month

GPU Instance Families Explained

For AI Inference (Running Models)

InstanceGPUVRAMvCPUsRAMOn-Demand $/hrBest For
g5.xlargeA10G (1x)24GB416GB$1.006Small models (7B-13B)
g5.2xlargeA10G (1x)24GB832GB$1.212Medium models, more CPU
g5.4xlargeA10G (1x)24GB1664GB$1.624CPU-heavy preprocessing
g5.12xlargeA10G (4x)96GB48192GB$5.672Large models (70B)
g6.xlargeL4 (1x)24GB416GB$0.805Best value for inference
inf2.xlargeInferentia232GB416GB$0.758AWS-optimized inference

For AI Training (Fine-Tuning)

InstanceGPUVRAMOn-Demand $/hrBest For
p4d.24xlargeA100 (8x)640GB$32.77Full fine-tuning
p5.48xlargeH100 (8x)640GB$98.32Large-scale training
trn1.2xlargeTrainium (1x)32GB$1.34AWS-optimized training

Rule of thumb: Use g5/g6 for inference, p4d/p5 for training, inf2/trn1 for AWS-optimized workloads.

Step 1: Choose Your Instance

Match the model to the GPU VRAM:

Model SizeVRAM Needed (FP16)VRAM Needed (Q4)Recommended Instance
7B params14GB4GBg6.xlarge ($0.80/hr)
13B params26GB8GBg5.xlarge ($1.00/hr)
34B params68GB20GBg5.xlarge ($1.00/hr) Q4 only
70B params140GB40GBg5.12xlarge ($5.67/hr)

Q4 quantization is the key. A 70B model in full precision needs 140GB VRAM (multiple GPUs). In Q4 quantization, it fits in 40GB — a single g5.12xlarge. Quality loss is minimal for inference.

Step 2: Launch the Instance

Using AWS CLI

# Launch g5.xlarge with Deep Learning AMI
aws ec2 run-instances \
  --image-id ami-0123456789abcdef0 \
  --instance-type g5.xlarge \
  --key-name your-key \
  --security-group-ids sg-0123456789abcdef0 \
  --subnet-id subnet-0123456789abcdef0 \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200,"VolumeType":"gp3"}}]' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ai-inference},{Key=Environment,Value=dev}]'

Use the AWS Deep Learning AMI — it comes with NVIDIA drivers, CUDA, cuDNN, and Docker pre-installed. Saves hours of setup.

Using Terraform

resource "aws_instance" "ai_inference" {
  ami           = data.aws_ami.deep_learning.id
  instance_type = "g5.xlarge"
  key_name      = var.key_name
  subnet_id     = var.private_subnet_id

  vpc_security_group_ids = [aws_security_group.ai_inference.id]

  root_block_device {
    volume_size = 200
    volume_type = "gp3"
    throughput  = 250
  }

  tags = {
    Name        = "${var.project}-ai-inference"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

data "aws_ami" "deep_learning" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["Deep Learning AMI GPU PyTorch * (Ubuntu 22.04) *"]
  }
}

Step 3: Install Ollama and Load Models

SSH into the instance:

ssh -i your-key.pem ubuntu@<instance-ip>

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU is detected
nvidia-smi
ollama run --verbose llama3.1:8b "test"

You should see GPU utilization in nvidia-smi while the model runs.

Pull the models you need:

# For inference
ollama pull llama3.1:8b
ollama pull deepseek-r1:14b
ollama pull codellama:13b

# For heavy workloads (g5.12xlarge only)
ollama pull llama3.1:70b-instruct-q4_K_M

Step 4: Expose the API Securely

Never expose Ollama directly to the internet. Use an ALB or Nginx reverse proxy.

# /etc/nginx/sites-available/ollama
server {
    listen 443 ssl;
    server_name ai.internal.yourdomain.com;

    ssl_certificate /etc/ssl/certs/ai.pem;
    ssl_certificate_key /etc/ssl/private/ai.key;

    # Basic auth for API access
    auth_basic "AI API";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;  # LLM responses can be slow
    }
}

For internal-only access, use a private subnet with no public IP and access via VPN or SSM Session Manager.

Server monitoring dashboard showing GPU utilization Monitor GPU utilization — an underutilized GPU means you are overpaying for your instance

Step 5: Cost Optimization

Use Spot Instances (Save 60-70%)

For non-critical inference (development, testing, batch processing):

aws ec2 request-spot-instances \
  --instance-count 1 \
  --type "persistent" \
  --launch-specification '{
    "ImageId": "ami-0123456789abcdef0",
    "InstanceType": "g5.xlarge",
    "KeyName": "your-key",
    "SecurityGroupIds": ["sg-xxx"],
    "SubnetId": "subnet-xxx"
  }'

Spot pricing for g5.xlarge fluctuates between $0.30-0.45/hr (vs $1.006 on-demand). That is $220-330/month instead of $734.

Warning: Spot instances can be interrupted with 2 minutes notice. Do not use for production inference serving real-time traffic.

Use Reserved Instances or Savings Plans (Save 30-40%)

For production workloads running 24/7:

Termg5.xlarge MonthlySavings vs On-Demand
On-demand$734-
1-year RI (no upfront)$46736%
1-year RI (all upfront)$43041%
3-year RI (all upfront)$27762%

A 1-year reserved g5.xlarge costs less than the cheapest on-demand alternative.

Schedule Start/Stop for Dev Instances

# Stop every night at 8 PM
aws ec2 stop-instances --instance-ids i-0123456789abcdef0

# Start every morning at 8 AM
aws ec2 start-instances --instance-ids i-0123456789abcdef0

Automate with AWS Instance Scheduler or a simple Lambda + EventBridge rule. Running a dev GPU instance only during business hours (12 hours/day, weekdays) cuts costs by 64%.

Right-Size Your Instance

Monitor GPU utilization:

nvidia-smi dmon -s u -d 60

If GPU utilization is consistently below 30%, you are overprovisioned. Move down:

  • g5.xlarge (24GB) to g6.xlarge (24GB) — same VRAM, 20% cheaper
  • g5.12xlarge (4x GPU) to g5.xlarge (1x GPU) — if model fits in 24GB with quantization

Use Inferentia2 for High-Volume Inference

AWS Inferentia2 chips (inf2 instances) are purpose-built for inference and 40% cheaper than equivalent GPU instances for supported models:

# inf2.xlarge: $0.758/hr vs g5.xlarge: $1.006/hr
# 25% cheaper with similar performance for supported models

The limitation: models must be compiled for Inferentia using the Neuron SDK. Supported models include Llama, GPT-NeoX, and BERT variants. Not all HuggingFace models work out of the box.

Monthly Cost Summary

For running Llama 3.1 8B in production (24/7):

StrategyInstanceMonthly Cost
On-demandg5.xlarge$734
On-demandg6.xlarge$588
Spotg5.xlarge$220-330
Reserved 1yrg5.xlarge$430
Reserved 1yrg6.xlarge$350
Inferentia2inf2.xlarge$553
Business hours onlyg5.xlarge$264
Bedrock (comparison)Llama via Bedrock$13-53

Key insight: For light-to-medium usage, Bedrock is cheaper than running your own GPU instance. EC2 GPU instances make sense only for high-volume 24/7 workloads, custom models, or fine-tuning scenarios where Bedrock is not an option.

Monitoring and Alerts

Set up CloudWatch alarms for GPU instances:

# Alert if GPU utilization drops below 10% (wasting money)
aws cloudwatch put-metric-alarm \
  --alarm-name "ai-gpu-underutilized" \
  --namespace "CWAgent" \
  --metric-name "nvidia_gpu_utilization" \
  --statistic Average \
  --period 3600 \
  --threshold 10 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 6 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Install the CloudWatch agent with GPU metrics:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config -m ec2 -s -c file:cw-agent-config.json

Key Takeaways

  • Match model size to GPU VRAM — Q4 quantization lets you run larger models on smaller GPUs
  • g6.xlarge is the best value for single-GPU inference in 2026
  • Spot instances save 60-70% for non-critical workloads
  • Schedule start/stop for dev instances — 64% savings for business-hours-only
  • Monitor GPU utilization — below 30% means you are overpaying
  • For light-to-medium usage, Bedrock is cheaper than maintaining your own GPU instance
  • EC2 GPU is worth it for 24/7 high-volume inference, custom models, or fine-tuning
  • Use the Deep Learning AMI — saves hours of driver and CUDA setup

FAQ

Do I need a GPU to run AI models on EC2?

Not always. CPU inference with quantized models works for light usage. A c6i.2xlarge (8 vCPU, 16GB) runs Llama 3.1 8B Q4 at ~5 tokens/sec for $0.34/hr. But for production inference with acceptable latency, GPU instances are necessary.

Which AWS region has the best GPU availability?

us-east-1 (Virginia) and us-west-2 (Oregon) have the most GPU capacity. Newer regions and smaller regions frequently have capacity shortages for g5 and p4d instances. If you cannot launch a GPU instance, try a different AZ or region.

Can I use EC2 GPU instances for fine-tuning?

Yes. p4d.24xlarge (8x A100) is the standard for fine-tuning 7B-13B models. For larger models, p5.48xlarge (8x H100). For budget fine-tuning, trn1 instances with AWS Trainium chips are 50% cheaper. SageMaker handles the orchestration if you prefer managed fine-tuning.

How do I handle instance interruption with Spot?

Use a Spot Fleet with mixed instance types and availability zones. Set up an interruption handler that saves model state and shifts traffic. For inference, run behind a load balancer with on-demand fallback instances.

Is Hetzner or bare-metal cheaper than EC2 for AI?

Significantly. A Hetzner AX102 (dedicated, AMD EPYC, 128GB RAM) costs approximately $85/month. Adding a rented GPU server with RTX 4090 costs $150-250/month. For 24/7 inference without AWS compliance requirements, dedicated servers are 3-5x cheaper than EC2 GPU instances.

Conclusion

Running AI models on EC2 GPU instances gives you full control over the inference stack — custom models, fine-tuning, and dedicated capacity. But that control comes with a cost that needs active management.

Pick the right instance family, use Spot or Reserved pricing, monitor GPU utilization, and stop dev instances after hours. The difference between a well-optimized and a default GPU deployment is hundreds of dollars per month.

For most DevOps teams, start with Bedrock or self-hosted Ollama on existing hardware. Graduate to EC2 GPU instances only when you hit the limits of those approaches.

Need help setting up GPU-accelerated AI infrastructure on AWS? View our AWS Infrastructure Setup service

Read next: How to Cut AWS Costs by 60%: A Complete Optimization Guide

Written by
SysOpX
Battle-tested DevOps & AWS engineering guides
Need DevOps help? →