Two Approaches to Production AI
AWS Bedrock gives you access to Claude, Llama, Mistral, and other models through a managed API. No infrastructure to manage, no models to download, no GPUs to provision. You pay per token.
Ollama runs models on your own hardware — a server, a VM, or a Proxmox homelab. You manage everything. You pay for hardware and electricity.
Both work in production. The right choice depends on your team size, usage volume, compliance requirements, and budget. After running both for over a year across client projects, here is when each one wins.
Bedrock and Ollama solve the same problem from opposite directions — managed vs self-hosted
Feature Comparison
| Feature | AWS Bedrock | Self-Hosted Ollama |
|---|---|---|
| Setup time | 10 minutes | 1-4 hours |
| Model selection | Claude, Llama, Mistral, Titan, Cohere | Any GGUF model on HuggingFace |
| Scaling | Automatic | Manual (add servers) |
| Latency | 50-200ms first token | 10-50ms first token (local) |
| Data privacy | AWS region, encrypted | Complete (never leaves your server) |
| Cost model | Per token | Fixed (hardware + electricity) |
| GPU management | None | You manage |
| Fine-tuning | Supported (limited models) | Full control |
| SLA | 99.9% | Depends on your infra |
| Compliance | SOC2, HIPAA, FedRAMP | Whatever you implement |
Cost Comparison: Real Numbers
Based on actual usage from a team of 3 DevOps engineers running 120 AI sessions per month. For detailed token-level pricing, see the full AI cost calculator.
AWS Bedrock Monthly Costs
Claude Sonnet via Bedrock:
Input: 1.5M tokens x $3.00/1M = $4.50
Output: 2.5M tokens x $15.00/1M = $37.50
Total: $42.00/month
Llama 3.1 70B via Bedrock:
Input: 1.5M tokens x $2.65/1M = $3.98
Output: 2.5M tokens x $3.50/1M = $8.75
Total: $12.73/month
Provisioned Throughput (for consistent latency):
1 model unit: ~$1,800/month
Self-Hosted Ollama Monthly Costs
Existing server (homelab/spare hardware):
Electricity: $10-20/month
Total: $10-20/month
Dedicated server (Hetzner AX52):
Server rental: $75/month (AMD Ryzen 9, 64GB, no GPU)
Runs: Llama 3.1 8B, DeepSeek R1 14B comfortably
Total: $75/month
AWS EC2 GPU instance (g5.xlarge):
On-demand: $1.006/hour x 730 hours = $734/month
Spot: ~$300-400/month
Reserved 1-year: ~$450/month
The Break-Even Math
| Usage Level | Bedrock (Llama) | Bedrock (Claude) | Ollama (Hetzner) | Ollama (Homelab) |
|---|---|---|---|---|
| Light (50 sessions) | $5 | $18 | $75 | $15 |
| Medium (120 sessions) | $13 | $42 | $75 | $15 |
| Heavy (500 sessions) | $53 | $175 | $75 | $18 |
| Team of 10 (2000 sessions) | $212 | $700 | $75 | $20 |
Break-even: Ollama on a dedicated server becomes cheaper than Bedrock Claude at approximately 250 sessions/month. On existing hardware, it is cheaper from day one for any usage above trivial.
When to Use AWS Bedrock
1. You Need the Best Models
Bedrock gives you access to Claude Opus, Claude Sonnet, and other frontier models that are significantly more capable than any open-source alternative for complex reasoning, architecture review, and nuanced analysis. If model quality matters more than cost, Bedrock wins.
2. You Cannot Manage Infrastructure
If your team is 1-2 people and you are already stretched managing production systems, adding an AI inference server to your maintenance burden is not worth it. Bedrock is zero-ops.
3. You Need Enterprise Compliance
Bedrock integrates with AWS IAM, CloudTrail, VPC endpoints, and KMS encryption. For organizations that need SOC2, HIPAA, or FedRAMP compliance on their AI workloads, this is pre-built. Building equivalent compliance around self-hosted Ollama takes significant effort.
4. Traffic Is Bursty
If your AI usage spikes during incidents or deployments and drops to near-zero otherwise, pay-per-token pricing is more efficient than running a server 24/7.
5. You Want Multi-Model Flexibility
Bedrock lets you switch between Claude, Llama, Mistral, and Cohere with a config change. Testing which model works best for your use case is trivial. With Ollama, each model needs to be downloaded and fit in memory.
Bedrock integrates directly with AWS security — IAM, CloudTrail, VPC endpoints, KMS
When to Use Self-Hosted Ollama
1. Data Cannot Leave Your Infrastructure
For defense contractors, healthcare systems, financial services, and compliance-sensitive environments, even AWS-hosted data may not meet requirements. Ollama on your own servers means data physically never leaves your control.
2. High-Volume or Predictable Usage
If your team runs 500+ AI sessions per month consistently, fixed-cost infrastructure is dramatically cheaper than per-token pricing. The cost curve flattens while API costs scale linearly.
3. You Want Full Model Control
Custom Modelfiles, quantization choices, context window adjustments, and the ability to run any open-source model the day it releases. Bedrock’s model catalog is curated — Ollama’s is the entire open-source ecosystem.
4. Latency Matters
Local inference on the same network eliminates API round-trip latency. For real-time applications or AI-powered Slack bots, the 10-50ms first-token time on local Ollama beats Bedrock’s 50-200ms.
5. You Already Have Hardware
Running Ollama on an existing server, homelab, or spare machine costs only electricity. The marginal cost of adding AI inference to existing infrastructure is near zero.
The Hybrid Architecture (Recommended)
The best setup uses both:
Daily Tasks (high volume, lower complexity)
→ Ollama (Llama 3.1 8B, DeepSeek R1)
→ Cost: fixed, minimal
Complex Tasks (architecture, security review, incidents)
→ AWS Bedrock (Claude Sonnet/Opus)
→ Cost: per-token, only when needed
Sensitive Data (client code, internal docs)
→ Ollama (always)
→ Cost: fixed, data stays local
Implementation with a Simple Router
import requests
import boto3
import json
OLLAMA_URL = "http://localhost:11434/api/generate"
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
def query_ai(prompt: str, complexity: str = "standard", sensitive: bool = False):
# Sensitive data always goes to local Ollama
if sensitive:
return query_ollama(prompt, model="llama3.1:8b")
# Complex tasks go to Bedrock
if complexity == "complex":
return query_bedrock(prompt, model_id="anthropic.claude-sonnet-4-20250514-v1:0")
# Everything else goes to local Ollama
return query_ollama(prompt, model="llama3.1:8b")
def query_ollama(prompt, model):
response = requests.post(OLLAMA_URL, json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
def query_bedrock(prompt, model_id):
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"messages": [{"role": "user", "content": prompt}]
})
response = bedrock.invoke_model(modelId=model_id, body=body)
result = json.loads(response["body"].read())
return result["content"][0]["text"]
This router sends sensitive data to Ollama, complex queries to Bedrock, and everything else to the cheapest option. The cost savings compound over months.
Bedrock Setup (5 Minutes)
# Enable model access in AWS Console:
# Bedrock → Model Access → Request Access → Select models
# Test with AWS CLI
aws bedrock-runtime invoke-model \
--model-id anthropic.claude-sonnet-4-20250514-v1:0 \
--content-type application/json \
--body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":256,"messages":[{"role":"user","content":"What is a VPC?"}]}' \
output.json
cat output.json | jq '.content[0].text'
Bedrock VPC Endpoint (for Private Access)
Keep Bedrock traffic off the public internet:
resource "aws_vpc_endpoint" "bedrock" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.bedrock-runtime"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.bedrock_endpoint.id]
private_dns_enabled = true
}
Performance Comparison
Tested with identical prompts — “Generate a Terraform VPC module with 3 public and 3 private subnets”:
| Metric | Bedrock (Claude Sonnet) | Bedrock (Llama 70B) | Ollama (Llama 8B) | Ollama (DeepSeek R1 14B) |
|---|---|---|---|---|
| First token | 180ms | 150ms | 25ms | 35ms |
| Total response | 8.2 sec | 6.1 sec | 12.4 sec | 18.6 sec |
| Output quality | Excellent | Good | Good | Very Good |
| Tokens generated | ~2100 | ~1800 | ~1600 | ~2000 |
Key insight: Bedrock has higher first-token latency (network round trip) but faster total generation (enterprise GPUs). Ollama has near-instant first token but slower generation on CPU. With a local GPU, Ollama matches Bedrock speed.
Security Comparison
| Security Aspect | AWS Bedrock | Self-Hosted Ollama |
|---|---|---|
| Data in transit | TLS 1.2+ (AWS managed) | Your TLS config |
| Data at rest | KMS encryption | Your encryption |
| Access control | IAM policies + roles | Your auth layer |
| Audit logging | CloudTrail automatic | Your logging |
| Network isolation | VPC endpoints | Your network |
| Data residency | AWS region choice | Physical location |
| Prompt logging | Opt-out available | You control |
Both can be made secure. Bedrock is secure by default. Ollama requires you to build security around it.
Key Takeaways
- Bedrock wins on quality (frontier models), compliance (pre-built), and zero-ops (managed)
- Ollama wins on cost (fixed), privacy (complete), latency (local), and flexibility (any model)
- The hybrid approach gives the best of both — Ollama for volume, Bedrock for complexity
- Break-even is approximately 250 sessions/month on a dedicated server
- Sensitive data should always go to self-hosted, regardless of cost
- Use Bedrock VPC endpoints to keep API traffic private within AWS
- For teams under 3 people with light usage, Bedrock-only is simpler and cost-effective
FAQ
Can I use the same code for both Bedrock and Ollama?
Nearly. Both support OpenAI-compatible API formats. Ollama natively exposes /v1/chat/completions. Bedrock requires the AWS SDK but the message format is similar. The router pattern shown above abstracts the difference with minimal code.
Does Bedrock support fine-tuning?
Yes, for select models (Llama, Titan). You upload training data to S3, configure a fine-tuning job, and Bedrock handles the training infrastructure. It is more limited than fine-tuning locally but requires zero GPU management.
Is Bedrock HIPAA compliant?
Yes. AWS Bedrock is HIPAA-eligible when used with a signed BAA (Business Associate Agreement). Enable CloudTrail logging, use VPC endpoints, and opt out of model improvement data sharing. For healthcare AI workloads on AWS, Bedrock is the compliant path.
Can Ollama handle multiple concurrent users?
Yes, but performance depends on hardware. On a 32GB RAM machine with an RTX 4070, Ollama handles 3-5 concurrent requests with Llama 3.1 8B at acceptable latency. For 10+ concurrent users, you need either a high-end GPU (A100, H100) or multiple Ollama instances behind a load balancer.
Should I run Ollama on AWS EC2 instead of on-premise?
You can, but it partly defeats the purpose. EC2 GPU instances are expensive ($450-734/month for g5.xlarge). If data privacy is not the primary concern, Bedrock is cheaper and simpler than running Ollama on EC2. Self-hosted Ollama makes the most sense on your own hardware or cheap dedicated servers (Hetzner, OVH).
Conclusion
The Bedrock vs Ollama decision is not about which is better — it is about which combination fits your constraints. Most production setups benefit from both: Ollama handling the high-volume daily work at fixed cost, Bedrock providing frontier model quality when complexity demands it.
Start with your primary constraint — budget, privacy, or quality — and let that guide the split. The hybrid approach with the router pattern gives you flexibility to adjust as your needs evolve.
Need help architecting your AI infrastructure on AWS? View our AWS Infrastructure Setup service
Read next: How to Run DeepSeek R1 Locally with Ollama