Why Serverless AI Makes Sense
Most AI workloads are bursty. A DevOps team runs 10 queries during an incident, zero queries overnight, and a few during code reviews. Paying for a GPU instance running 24/7 to serve these sporadic requests is wasteful.
AWS Lambda + Bedrock changes the math: you pay only when a request executes. No idle costs. No server management. No capacity planning. The infrastructure scales from zero to thousands of concurrent requests automatically.
For teams that need AI capabilities without the infrastructure overhead, this is the simplest production-ready architecture available.
Serverless AI eliminates idle costs — pay only for the requests you actually make
Architecture
Client Request
|
v
API Gateway (REST or HTTP API)
|
v
AWS Lambda (Python runtime)
|
v
Amazon Bedrock (Claude / Llama / Mistral)
|
v
Response back to client
No EC2 instances. No containers. No GPU management. The entire stack is managed by AWS.
Step 1: Create the Lambda Function
Project Structure
lambda-ai-api/
|- handler.py
|- requirements.txt
|- template.yaml (SAM)
The Lambda Handler
# handler.py
import json
import boto3
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
# Model configurations
MODELS = {
"fast": {
"id": "anthropic.claude-haiku-4-5-20251001-v1:0",
"max_tokens": 1024
},
"standard": {
"id": "anthropic.claude-sonnet-4-20250514-v1:0",
"max_tokens": 4096
},
"reasoning": {
"id": "meta.llama3-1-70b-instruct-v1:0",
"max_tokens": 4096
}
}
def lambda_handler(event, context):
try:
body = json.loads(event.get("body", "{}"))
prompt = body.get("prompt")
model_tier = body.get("model", "standard")
system_prompt = body.get("system", "You are a helpful DevOps assistant.")
if not prompt:
return response(400, {"error": "prompt is required"})
model_config = MODELS.get(model_tier, MODELS["standard"])
# Call Bedrock
result = bedrock.invoke_model(
modelId=model_config["id"],
contentType="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": model_config["max_tokens"],
"system": system_prompt,
"messages": [
{"role": "user", "content": prompt}
]
})
)
response_body = json.loads(result["body"].read())
answer = response_body["content"][0]["text"]
usage = response_body.get("usage", {})
return response(200, {
"answer": answer,
"model": model_tier,
"tokens": {
"input": usage.get("input_tokens", 0),
"output": usage.get("output_tokens", 0)
}
})
except Exception as e:
return response(500, {"error": str(e)})
def response(status_code, body):
return {
"statusCode": status_code,
"headers": {
"Content-Type": "application/json",
"Access-Control-Allow-Origin": "*"
},
"body": json.dumps(body)
}
Step 2: IAM Role for Lambda
The Lambda function needs permission to invoke Bedrock models:
resource "aws_iam_role" "lambda_ai" {
name = "lambda-ai-api-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}]
})
}
resource "aws_iam_role_policy" "bedrock_access" {
name = "bedrock-invoke"
role = aws_iam_role.lambda_ai.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
]
Resource = [
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.*",
"arn:aws:bedrock:us-east-1::foundation-model/meta.*"
]
},
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:*:*:*"
}
]
})
}
Step 3: Deploy with SAM
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Timeout: 120
MemorySize: 512
Runtime: python3.12
Resources:
AIFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handler.lambda_handler
Role: !GetAtt LambdaRole.Arn
Events:
ApiEvent:
Type: HttpApi
Properties:
Path: /ai
Method: post
LambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: bedrock-access
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- bedrock:InvokeModel
Resource: '*'
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: '*'
Outputs:
ApiEndpoint:
Value: !Sub "https://${ServerlessHttpApi}.execute-api.${AWS::Region}.amazonaws.com/ai"
Deploy:
sam build
sam deploy --guided
Step 4: Test the API
# Standard query (Claude Sonnet)
curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/ai \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a Terraform module for an S3 bucket with versioning and encryption",
"model": "standard"
}'
# Fast query (Claude Haiku)
curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/ai \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is the kubectl command to check pod logs?",
"model": "fast"
}'
The serverless AI API handles everything from quick lookups to complex code generation
Step 5: Add Authentication
API Key Authentication
# Add to template.yaml
Resources:
AIFunction:
Properties:
Events:
ApiEvent:
Type: HttpApi
Properties:
Path: /ai
Method: post
Auth:
ApiKeyRequired: true
Cognito Authentication (for User-Based Access)
resource "aws_cognito_user_pool" "ai_api" {
name = "ai-api-users"
password_policy {
minimum_length = 12
require_lowercase = true
require_numbers = true
require_symbols = true
require_uppercase = true
}
}
resource "aws_apigatewayv2_authorizer" "cognito" {
api_id = aws_apigatewayv2_api.ai_api.id
authorizer_type = "JWT"
identity_sources = ["$request.header.Authorization"]
name = "cognito-auth"
jwt_configuration {
audience = [aws_cognito_user_pool_client.ai_api.id]
issuer = "https://${aws_cognito_user_pool.ai_api.endpoint}"
}
}
Cost Analysis
Lambda + Bedrock pricing for a team of 5 engineers:
Per-Request Cost Breakdown
| Component | Cost per Request |
|---|---|
| Lambda (512MB, 10 sec avg) | $0.000083 |
| API Gateway | $0.000001 |
| Bedrock (Claude Sonnet, avg 3K in / 2K out) | $0.039 |
| Total per request | ~$0.04 |
Monthly Cost by Usage
| Monthly Requests | Lambda | API Gateway | Bedrock (Sonnet) | Total |
|---|---|---|---|---|
| 100 | $0.01 | $0.00 | $3.90 | $3.91 |
| 500 | $0.04 | $0.01 | $19.50 | $19.55 |
| 1,000 | $0.08 | $0.01 | $39.00 | $39.09 |
| 5,000 | $0.42 | $0.05 | $195.00 | $195.47 |
Key insight: Lambda and API Gateway costs are negligible. Bedrock token usage dominates the bill. The serverless architecture means you pay exactly proportional to usage — zero requests costs $0.
Comparison with EC2 GPU
| Metric | Lambda + Bedrock | EC2 g5.xlarge + Ollama |
|---|---|---|
| 100 requests/month | $3.91 | $734 |
| 500 requests/month | $19.55 | $734 |
| 1,000 requests/month | $39.09 | $734 |
| Break-even | ~18,800 requests | Fixed |
Lambda + Bedrock is cheaper until approximately 18,800 requests per month. For most DevOps teams, that is far beyond normal usage.
Advanced Patterns
Streaming Responses
For long AI responses, stream tokens instead of waiting for completion:
def lambda_handler_stream(event, context):
body = json.loads(event.get("body", "{}"))
response = bedrock.invoke_model_with_response_stream(
modelId="anthropic.claude-sonnet-4-20250514-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"messages": [{"role": "user", "content": body["prompt"]}]
})
)
# Process stream (requires Lambda Function URL with streaming)
stream = response["body"]
for event in stream:
chunk = json.loads(event["chunk"]["bytes"])
if chunk["type"] == "content_block_delta":
yield chunk["delta"]["text"]
Caching Frequent Queries
Add DynamoDB caching to avoid repeat Bedrock calls:
import hashlib
dynamodb = boto3.resource("dynamodb")
cache_table = dynamodb.Table("ai-cache")
def get_cached_or_query(prompt, model):
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
# Check cache
cached = cache_table.get_item(Key={"id": cache_key}).get("Item")
if cached:
return cached["answer"]
# Query Bedrock
answer = query_bedrock(prompt, model)
# Cache for 24 hours
cache_table.put_item(Item={
"id": cache_key,
"answer": answer,
"ttl": int(time.time()) + 86400
})
return answer
Rate Limiting
Prevent runaway costs with API Gateway throttling:
Resources:
AIApi:
Type: AWS::Serverless::HttpApi
Properties:
DefaultRouteSettings:
ThrottlingBurstLimit: 10
ThrottlingRateLimit: 5
This limits the API to 5 requests per second with a burst of 10. Adjust based on your team size and budget.
Key Takeaways
- Lambda + Bedrock is the cheapest AI architecture for teams under 18,000 requests/month
- Zero idle costs — pay only when requests execute
- Auto-scales from 0 to thousands of concurrent requests
- Lambda costs are negligible — Bedrock tokens dominate the bill
- Add DynamoDB caching to reduce repeat queries and costs
- Use model tiers (Haiku for fast, Sonnet for standard) to optimize spend
- Set API Gateway throttling to prevent cost overruns
- The entire stack deploys in 10 minutes with SAM
FAQ
What is the Lambda timeout for AI requests?
Lambda supports up to 15 minutes timeout. Most Bedrock requests complete in 5-30 seconds depending on model and output length. Set timeout to 120 seconds for safety. If responses consistently approach the timeout, the prompt is too complex or the output too long.
Can Lambda handle streaming AI responses?
Yes, using Lambda Function URLs with response streaming. Standard API Gateway does not support streaming — it buffers the full response. For streaming, use Lambda Function URLs directly or CloudFront + Lambda@Edge.
Is Lambda cold start a problem for AI APIs?
Lambda cold starts add 1-3 seconds on the first request after idle. Since Bedrock API calls themselves take 3-15 seconds, the cold start is a small percentage of total latency. For latency-sensitive production APIs, use Provisioned Concurrency ($0.015/GB-hour) to keep functions warm.
Can I use my own models instead of Bedrock?
Not directly with this architecture. Lambda does not have GPU access. For custom models, use EC2 GPU instances or SageMaker endpoints. You can replace the Bedrock call with a SageMaker endpoint invocation using the same Lambda architecture.
How do I monitor costs?
Enable Cost Explorer tags on Lambda and Bedrock. Set AWS Budget alerts at your monthly limit. The API handler returns token counts in the response — log these to CloudWatch for usage tracking. Review weekly alongside your AWS cost optimization process.
Conclusion
Serverless AI is the simplest way to add AI capabilities to your tools and workflows. No GPUs to manage, no instances to monitor, no capacity to plan. Deploy the Lambda function, hit the API endpoint, get AI responses.
For teams that need AI occasionally — during incidents, code reviews, or documentation — this architecture costs single-digit dollars per month. It scales to thousands of requests without changes. And when nobody is using it, it costs exactly zero.
Need help building serverless AI APIs on AWS? View our AWS Infrastructure Setup service
Read next: AWS Bedrock vs Self-Hosted Ollama: When to Use Each