Split view comparing AWS Bedrock cloud AI service with self-hosted Ollama infrastructure
← All Articles
AI + DevOps

AWS Bedrock vs Self-Hosted Ollama: When to Use Each for Production AI

Two Approaches to Production AI

AWS Bedrock gives you access to Claude, Llama, Mistral, and other models through a managed API. No infrastructure to manage, no models to download, no GPUs to provision. You pay per token.

Ollama runs models on your own hardware — a server, a VM, or a Proxmox homelab. You manage everything. You pay for hardware and electricity.

Both work in production. The right choice depends on your team size, usage volume, compliance requirements, and budget. After running both for over a year across client projects, here is when each one wins.

Global data center network representing cloud AI infrastructure Bedrock and Ollama solve the same problem from opposite directions — managed vs self-hosted

Feature Comparison

FeatureAWS BedrockSelf-Hosted Ollama
Setup time10 minutes1-4 hours
Model selectionClaude, Llama, Mistral, Titan, CohereAny GGUF model on HuggingFace
ScalingAutomaticManual (add servers)
Latency50-200ms first token10-50ms first token (local)
Data privacyAWS region, encryptedComplete (never leaves your server)
Cost modelPer tokenFixed (hardware + electricity)
GPU managementNoneYou manage
Fine-tuningSupported (limited models)Full control
SLA99.9%Depends on your infra
ComplianceSOC2, HIPAA, FedRAMPWhatever you implement

Cost Comparison: Real Numbers

Based on actual usage from a team of 3 DevOps engineers running 120 AI sessions per month. For detailed token-level pricing, see the full AI cost calculator.

AWS Bedrock Monthly Costs

Claude Sonnet via Bedrock:
  Input:  1.5M tokens x $3.00/1M  = $4.50
  Output: 2.5M tokens x $15.00/1M = $37.50
  Total: $42.00/month

Llama 3.1 70B via Bedrock:
  Input:  1.5M tokens x $2.65/1M  = $3.98
  Output: 2.5M tokens x $3.50/1M  = $8.75
  Total: $12.73/month

Provisioned Throughput (for consistent latency):
  1 model unit: ~$1,800/month

Self-Hosted Ollama Monthly Costs

Existing server (homelab/spare hardware):
  Electricity: $10-20/month
  Total: $10-20/month

Dedicated server (Hetzner AX52):
  Server rental: $75/month (AMD Ryzen 9, 64GB, no GPU)
  Runs: Llama 3.1 8B, DeepSeek R1 14B comfortably
  Total: $75/month

AWS EC2 GPU instance (g5.xlarge):
  On-demand: $1.006/hour x 730 hours = $734/month
  Spot: ~$300-400/month
  Reserved 1-year: ~$450/month

The Break-Even Math

Usage LevelBedrock (Llama)Bedrock (Claude)Ollama (Hetzner)Ollama (Homelab)
Light (50 sessions)$5$18$75$15
Medium (120 sessions)$13$42$75$15
Heavy (500 sessions)$53$175$75$18
Team of 10 (2000 sessions)$212$700$75$20

Break-even: Ollama on a dedicated server becomes cheaper than Bedrock Claude at approximately 250 sessions/month. On existing hardware, it is cheaper from day one for any usage above trivial.

When to Use AWS Bedrock

1. You Need the Best Models

Bedrock gives you access to Claude Opus, Claude Sonnet, and other frontier models that are significantly more capable than any open-source alternative for complex reasoning, architecture review, and nuanced analysis. If model quality matters more than cost, Bedrock wins.

2. You Cannot Manage Infrastructure

If your team is 1-2 people and you are already stretched managing production systems, adding an AI inference server to your maintenance burden is not worth it. Bedrock is zero-ops.

3. You Need Enterprise Compliance

Bedrock integrates with AWS IAM, CloudTrail, VPC endpoints, and KMS encryption. For organizations that need SOC2, HIPAA, or FedRAMP compliance on their AI workloads, this is pre-built. Building equivalent compliance around self-hosted Ollama takes significant effort.

4. Traffic Is Bursty

If your AI usage spikes during incidents or deployments and drops to near-zero otherwise, pay-per-token pricing is more efficient than running a server 24/7.

5. You Want Multi-Model Flexibility

Bedrock lets you switch between Claude, Llama, Mistral, and Cohere with a config change. Testing which model works best for your use case is trivial. With Ollama, each model needs to be downloaded and fit in memory.

Cloud infrastructure with security and encryption layers Bedrock integrates directly with AWS security — IAM, CloudTrail, VPC endpoints, KMS

When to Use Self-Hosted Ollama

1. Data Cannot Leave Your Infrastructure

For defense contractors, healthcare systems, financial services, and compliance-sensitive environments, even AWS-hosted data may not meet requirements. Ollama on your own servers means data physically never leaves your control.

2. High-Volume or Predictable Usage

If your team runs 500+ AI sessions per month consistently, fixed-cost infrastructure is dramatically cheaper than per-token pricing. The cost curve flattens while API costs scale linearly.

3. You Want Full Model Control

Custom Modelfiles, quantization choices, context window adjustments, and the ability to run any open-source model the day it releases. Bedrock’s model catalog is curated — Ollama’s is the entire open-source ecosystem.

4. Latency Matters

Local inference on the same network eliminates API round-trip latency. For real-time applications or AI-powered Slack bots, the 10-50ms first-token time on local Ollama beats Bedrock’s 50-200ms.

5. You Already Have Hardware

Running Ollama on an existing server, homelab, or spare machine costs only electricity. The marginal cost of adding AI inference to existing infrastructure is near zero.

The best setup uses both:

Daily Tasks (high volume, lower complexity)
  → Ollama (Llama 3.1 8B, DeepSeek R1)
  → Cost: fixed, minimal

Complex Tasks (architecture, security review, incidents)
  → AWS Bedrock (Claude Sonnet/Opus)
  → Cost: per-token, only when needed

Sensitive Data (client code, internal docs)
  → Ollama (always)
  → Cost: fixed, data stays local

Implementation with a Simple Router

import requests
import boto3
import json

OLLAMA_URL = "http://localhost:11434/api/generate"

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")


def query_ai(prompt: str, complexity: str = "standard", sensitive: bool = False):
    # Sensitive data always goes to local Ollama
    if sensitive:
        return query_ollama(prompt, model="llama3.1:8b")

    # Complex tasks go to Bedrock
    if complexity == "complex":
        return query_bedrock(prompt, model_id="anthropic.claude-sonnet-4-20250514-v1:0")

    # Everything else goes to local Ollama
    return query_ollama(prompt, model="llama3.1:8b")


def query_ollama(prompt, model):
    response = requests.post(OLLAMA_URL, json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return response.json()["response"]


def query_bedrock(prompt, model_id):
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,
        "messages": [{"role": "user", "content": prompt}]
    })
    response = bedrock.invoke_model(modelId=model_id, body=body)
    result = json.loads(response["body"].read())
    return result["content"][0]["text"]

This router sends sensitive data to Ollama, complex queries to Bedrock, and everything else to the cheapest option. The cost savings compound over months.

Bedrock Setup (5 Minutes)

# Enable model access in AWS Console:
# Bedrock → Model Access → Request Access → Select models

# Test with AWS CLI
aws bedrock-runtime invoke-model \
  --model-id anthropic.claude-sonnet-4-20250514-v1:0 \
  --content-type application/json \
  --body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":256,"messages":[{"role":"user","content":"What is a VPC?"}]}' \
  output.json

cat output.json | jq '.content[0].text'

Bedrock VPC Endpoint (for Private Access)

Keep Bedrock traffic off the public internet:

resource "aws_vpc_endpoint" "bedrock" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.bedrock-runtime"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.bedrock_endpoint.id]
  private_dns_enabled = true
}

Performance Comparison

Tested with identical prompts — “Generate a Terraform VPC module with 3 public and 3 private subnets”:

MetricBedrock (Claude Sonnet)Bedrock (Llama 70B)Ollama (Llama 8B)Ollama (DeepSeek R1 14B)
First token180ms150ms25ms35ms
Total response8.2 sec6.1 sec12.4 sec18.6 sec
Output qualityExcellentGoodGoodVery Good
Tokens generated~2100~1800~1600~2000

Key insight: Bedrock has higher first-token latency (network round trip) but faster total generation (enterprise GPUs). Ollama has near-instant first token but slower generation on CPU. With a local GPU, Ollama matches Bedrock speed.

Security Comparison

Security AspectAWS BedrockSelf-Hosted Ollama
Data in transitTLS 1.2+ (AWS managed)Your TLS config
Data at restKMS encryptionYour encryption
Access controlIAM policies + rolesYour auth layer
Audit loggingCloudTrail automaticYour logging
Network isolationVPC endpointsYour network
Data residencyAWS region choicePhysical location
Prompt loggingOpt-out availableYou control

Both can be made secure. Bedrock is secure by default. Ollama requires you to build security around it.

Key Takeaways

  • Bedrock wins on quality (frontier models), compliance (pre-built), and zero-ops (managed)
  • Ollama wins on cost (fixed), privacy (complete), latency (local), and flexibility (any model)
  • The hybrid approach gives the best of both — Ollama for volume, Bedrock for complexity
  • Break-even is approximately 250 sessions/month on a dedicated server
  • Sensitive data should always go to self-hosted, regardless of cost
  • Use Bedrock VPC endpoints to keep API traffic private within AWS
  • For teams under 3 people with light usage, Bedrock-only is simpler and cost-effective

FAQ

Can I use the same code for both Bedrock and Ollama?

Nearly. Both support OpenAI-compatible API formats. Ollama natively exposes /v1/chat/completions. Bedrock requires the AWS SDK but the message format is similar. The router pattern shown above abstracts the difference with minimal code.

Does Bedrock support fine-tuning?

Yes, for select models (Llama, Titan). You upload training data to S3, configure a fine-tuning job, and Bedrock handles the training infrastructure. It is more limited than fine-tuning locally but requires zero GPU management.

Is Bedrock HIPAA compliant?

Yes. AWS Bedrock is HIPAA-eligible when used with a signed BAA (Business Associate Agreement). Enable CloudTrail logging, use VPC endpoints, and opt out of model improvement data sharing. For healthcare AI workloads on AWS, Bedrock is the compliant path.

Can Ollama handle multiple concurrent users?

Yes, but performance depends on hardware. On a 32GB RAM machine with an RTX 4070, Ollama handles 3-5 concurrent requests with Llama 3.1 8B at acceptable latency. For 10+ concurrent users, you need either a high-end GPU (A100, H100) or multiple Ollama instances behind a load balancer.

Should I run Ollama on AWS EC2 instead of on-premise?

You can, but it partly defeats the purpose. EC2 GPU instances are expensive ($450-734/month for g5.xlarge). If data privacy is not the primary concern, Bedrock is cheaper and simpler than running Ollama on EC2. Self-hosted Ollama makes the most sense on your own hardware or cheap dedicated servers (Hetzner, OVH).

Conclusion

The Bedrock vs Ollama decision is not about which is better — it is about which combination fits your constraints. Most production setups benefit from both: Ollama handling the high-volume daily work at fixed cost, Bedrock providing frontier model quality when complexity demands it.

Start with your primary constraint — budget, privacy, or quality — and let that guide the split. The hybrid approach with the router pattern gives you flexibility to adjust as your needs evolve.

Need help architecting your AI infrastructure on AWS? View our AWS Infrastructure Setup service

Read next: How to Run DeepSeek R1 Locally with Ollama

Written by
SysOpX
Battle-tested DevOps & AWS engineering guides
Need DevOps help? →