AWS Bedrock vs Self-Hosted Ollama: When to Use Each for Production AI

Two Approaches to Production AI

AWS Bedrock gives you access to Claude, Llama, Mistral, and other models through a managed API. No infrastructure to manage, no models to download, no GPUs to provision. You pay per token.

Ollama runs models on your own hardware — a server, a VM, or a Proxmox homelab. You manage everything. You pay for hardware and electricity.

Both work in production. The right choice depends on your team size, usage volume, compliance requirements, and budget. After running both for over a year across client projects, here is when each one wins.

Global data center network representing cloud AI infrastructure Bedrock and Ollama solve the same problem from opposite directions — managed vs self-hosted

Feature Comparison

Feature	AWS Bedrock	Self-Hosted Ollama
Setup time	10 minutes	1-4 hours
Model selection	Claude, Llama, Mistral, Titan, Cohere	Any GGUF model on HuggingFace
Scaling	Automatic	Manual (add servers)
Latency	50-200ms first token	10-50ms first token (local)
Data privacy	AWS region, encrypted	Complete (never leaves your server)
Cost model	Per token	Fixed (hardware + electricity)
GPU management	None	You manage
Fine-tuning	Supported (limited models)	Full control
SLA	99.9%	Depends on your infra
Compliance	SOC2, HIPAA, FedRAMP	Whatever you implement

Cost Comparison: Real Numbers

Based on actual usage from a team of 3 DevOps engineers running 120 AI sessions per month. For detailed token-level pricing, see the full AI cost calculator.

AWS Bedrock Monthly Costs

Claude Sonnet via Bedrock:
  Input:  1.5M tokens x $3.00/1M  = $4.50
  Output: 2.5M tokens x $15.00/1M = $37.50
  Total: $42.00/month

Llama 3.1 70B via Bedrock:
  Input:  1.5M tokens x $2.65/1M  = $3.98
  Output: 2.5M tokens x $3.50/1M  = $8.75
  Total: $12.73/month

Provisioned Throughput (for consistent latency):
  1 model unit: ~$1,800/month

Self-Hosted Ollama Monthly Costs

Existing server (homelab/spare hardware):
  Electricity: $10-20/month
  Total: $10-20/month

Dedicated server (Hetzner AX52):
  Server rental: $75/month (AMD Ryzen 9, 64GB, no GPU)
  Runs: Llama 3.1 8B, DeepSeek R1 14B comfortably
  Total: $75/month

AWS EC2 GPU instance (g5.xlarge):
  On-demand: $1.006/hour x 730 hours = $734/month
  Spot: ~$300-400/month
  Reserved 1-year: ~$450/month

The Break-Even Math

Usage Level	Bedrock (Llama)	Bedrock (Claude)	Ollama (Hetzner)	Ollama (Homelab)
Light (50 sessions)	$5	$18	$75	$15
Medium (120 sessions)	$13	$42	$75	$15
Heavy (500 sessions)	$53	$175	$75	$18
Team of 10 (2000 sessions)	$212	$700	$75	$20

Break-even: Ollama on a dedicated server becomes cheaper than Bedrock Claude at approximately 250 sessions/month. On existing hardware, it is cheaper from day one for any usage above trivial.

When to Use AWS Bedrock

1. You Need the Best Models

Bedrock gives you access to Claude Opus, Claude Sonnet, and other frontier models that are significantly more capable than any open-source alternative for complex reasoning, architecture review, and nuanced analysis. If model quality matters more than cost, Bedrock wins.

2. You Cannot Manage Infrastructure

If your team is 1-2 people and you are already stretched managing production systems, adding an AI inference server to your maintenance burden is not worth it. Bedrock is zero-ops.

3. You Need Enterprise Compliance

Bedrock integrates with AWS IAM, CloudTrail, VPC endpoints, and KMS encryption. For organizations that need SOC2, HIPAA, or FedRAMP compliance on their AI workloads, this is pre-built. Building equivalent compliance around self-hosted Ollama takes significant effort.

4. Traffic Is Bursty

If your AI usage spikes during incidents or deployments and drops to near-zero otherwise, pay-per-token pricing is more efficient than running a server 24/7.

5. You Want Multi-Model Flexibility

Bedrock lets you switch between Claude, Llama, Mistral, and Cohere with a config change. Testing which model works best for your use case is trivial. With Ollama, each model needs to be downloaded and fit in memory.

Cloud infrastructure with security and encryption layers Bedrock integrates directly with AWS security — IAM, CloudTrail, VPC endpoints, KMS

When to Use Self-Hosted Ollama

1. Data Cannot Leave Your Infrastructure

For defense contractors, healthcare systems, financial services, and compliance-sensitive environments, even AWS-hosted data may not meet requirements. Ollama on your own servers means data physically never leaves your control.

2. High-Volume or Predictable Usage

If your team runs 500+ AI sessions per month consistently, fixed-cost infrastructure is dramatically cheaper than per-token pricing. The cost curve flattens while API costs scale linearly.

3. You Want Full Model Control

Custom Modelfiles, quantization choices, context window adjustments, and the ability to run any open-source model the day it releases. Bedrock’s model catalog is curated — Ollama’s is the entire open-source ecosystem.

4. Latency Matters

Local inference on the same network eliminates API round-trip latency. For real-time applications or AI-powered Slack bots, the 10-50ms first-token time on local Ollama beats Bedrock’s 50-200ms.

5. You Already Have Hardware

Running Ollama on an existing server, homelab, or spare machine costs only electricity. The marginal cost of adding AI inference to existing infrastructure is near zero.

The Hybrid Architecture (Recommended)

The best setup uses both:

Daily Tasks (high volume, lower complexity)
  → Ollama (Llama 3.1 8B, DeepSeek R1)
  → Cost: fixed, minimal

Complex Tasks (architecture, security review, incidents)
  → AWS Bedrock (Claude Sonnet/Opus)
  → Cost: per-token, only when needed

Sensitive Data (client code, internal docs)
  → Ollama (always)
  → Cost: fixed, data stays local

Implementation with a Simple Router

import requests
import boto3
import json

OLLAMA_URL = "http://localhost:11434/api/generate"

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")


def query_ai(prompt: str, complexity: str = "standard", sensitive: bool = False):
    # Sensitive data always goes to local Ollama
    if sensitive:
        return query_ollama(prompt, model="llama3.1:8b")

    # Complex tasks go to Bedrock
    if complexity == "complex":
        return query_bedrock(prompt, model_id="anthropic.claude-sonnet-4-20250514-v1:0")

    # Everything else goes to local Ollama
    return query_ollama(prompt, model="llama3.1:8b")


def query_ollama(prompt, model):
    response = requests.post(OLLAMA_URL, json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return response.json()["response"]


def query_bedrock(prompt, model_id):
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,
        "messages": [{"role": "user", "content": prompt}]
    })
    response = bedrock.invoke_model(modelId=model_id, body=body)
    result = json.loads(response["body"].read())
    return result["content"][0]["text"]

This router sends sensitive data to Ollama, complex queries to Bedrock, and everything else to the cheapest option. The cost savings compound over months.

Bedrock Setup (5 Minutes)

# Enable model access in AWS Console:
# Bedrock → Model Access → Request Access → Select models

# Test with AWS CLI
aws bedrock-runtime invoke-model \
  --model-id anthropic.claude-sonnet-4-20250514-v1:0 \
  --content-type application/json \
  --body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":256,"messages":[{"role":"user","content":"What is a VPC?"}]}' \
  output.json

cat output.json | jq '.content[0].text'

Bedrock VPC Endpoint (for Private Access)

Keep Bedrock traffic off the public internet:

resource "aws_vpc_endpoint" "bedrock" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.bedrock-runtime"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.bedrock_endpoint.id]
  private_dns_enabled = true
}

Performance Comparison

Tested with identical prompts — “Generate a Terraform VPC module with 3 public and 3 private subnets”:

Metric	Bedrock (Claude Sonnet)	Bedrock (Llama 70B)	Ollama (Llama 8B)	Ollama (DeepSeek R1 14B)
First token	180ms	150ms	25ms	35ms
Total response	8.2 sec	6.1 sec	12.4 sec	18.6 sec
Output quality	Excellent	Good	Good	Very Good
Tokens generated	~2100	~1800	~1600	~2000

Key insight: Bedrock has higher first-token latency (network round trip) but faster total generation (enterprise GPUs). Ollama has near-instant first token but slower generation on CPU. With a local GPU, Ollama matches Bedrock speed.

Security Comparison

Security Aspect	AWS Bedrock	Self-Hosted Ollama
Data in transit	TLS 1.2+ (AWS managed)	Your TLS config
Data at rest	KMS encryption	Your encryption
Access control	IAM policies + roles	Your auth layer
Audit logging	CloudTrail automatic	Your logging
Network isolation	VPC endpoints	Your network
Data residency	AWS region choice	Physical location
Prompt logging	Opt-out available	You control

Both can be made secure. Bedrock is secure by default. Ollama requires you to build security around it.

Key Takeaways

Bedrock wins on quality (frontier models), compliance (pre-built), and zero-ops (managed)
Ollama wins on cost (fixed), privacy (complete), latency (local), and flexibility (any model)
The hybrid approach gives the best of both — Ollama for volume, Bedrock for complexity
Break-even is approximately 250 sessions/month on a dedicated server
Sensitive data should always go to self-hosted, regardless of cost
Use Bedrock VPC endpoints to keep API traffic private within AWS
For teams under 3 people with light usage, Bedrock-only is simpler and cost-effective

FAQ

Can I use the same code for both Bedrock and Ollama?

Nearly. Both support OpenAI-compatible API formats. Ollama natively exposes /v1/chat/completions. Bedrock requires the AWS SDK but the message format is similar. The router pattern shown above abstracts the difference with minimal code.

Does Bedrock support fine-tuning?

Yes, for select models (Llama, Titan). You upload training data to S3, configure a fine-tuning job, and Bedrock handles the training infrastructure. It is more limited than fine-tuning locally but requires zero GPU management.

Is Bedrock HIPAA compliant?

Yes. AWS Bedrock is HIPAA-eligible when used with a signed BAA (Business Associate Agreement). Enable CloudTrail logging, use VPC endpoints, and opt out of model improvement data sharing. For healthcare AI workloads on AWS, Bedrock is the compliant path.

Can Ollama handle multiple concurrent users?

Yes, but performance depends on hardware. On a 32GB RAM machine with an RTX 4070, Ollama handles 3-5 concurrent requests with Llama 3.1 8B at acceptable latency. For 10+ concurrent users, you need either a high-end GPU (A100, H100) or multiple Ollama instances behind a load balancer.

Should I run Ollama on AWS EC2 instead of on-premise?

You can, but it partly defeats the purpose. EC2 GPU instances are expensive ($450-734/month for g5.xlarge). If data privacy is not the primary concern, Bedrock is cheaper and simpler than running Ollama on EC2. Self-hosted Ollama makes the most sense on your own hardware or cheap dedicated servers (Hetzner, OVH).

Conclusion

The Bedrock vs Ollama decision is not about which is better — it is about which combination fits your constraints. Most production setups benefit from both: Ollama handling the high-volume daily work at fixed cost, Bedrock providing frontier model quality when complexity demands it.

Start with your primary constraint — budget, privacy, or quality — and let that guide the split. The hybrid approach with the router pattern gives you flexibility to adjust as your needs evolve.

Need help architecting your AI infrastructure on AWS? View our AWS Infrastructure Setup service