How to Fine-Tune a Small LLM on Your AWS Infrastructure Logs (SageMaker Guide)
Why Fine-Tune Instead of Prompting
General-purpose models like Claude and GPT-4 understand infrastructure concepts. But they do not understand YOUR infrastructure — your naming conventions, your alert patterns, your specific failure modes, your runbook terminology.
Fine-tuning a small model on your own data creates an AI that speaks your team’s language. When it sees payment-api-prod-503-spike, it knows that means the payment service behind the ALB in us-east-1, not a generic HTTP error explanation.
The trade-off: fine-tuning requires effort upfront. But the result is a model that handles your routine questions better than any general-purpose model — at a fraction of the inference cost.
Fine-tuning teaches a small model the patterns specific to your infrastructure
What You Need
- AWS account with SageMaker access
- 500-2000 training examples (question-answer pairs from your infrastructure)
- A base model (Llama 3.1 8B or Mistral 7B recommended)
- 2-4 hours for the fine-tuning job
- Budget: $20-80 for a single training run on ml.g5.2xlarge
Step 1: Collect Training Data
The quality of your fine-tuned model depends entirely on training data quality. Collect from these sources:
CloudWatch Logs and Alerts
# Export recent CloudWatch log insights
aws logs start-query \
--log-group-name "/aws/ecs/payment-api" \
--start-time $(date -d '30 days ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR|WARN|Exception/ | limit 1000'
Incident Postmortems
Extract question-answer pairs from your existing postmortems:
{
"instruction": "The payment API is returning 503 errors and CloudWatch shows ECS task count dropped from 4 to 1. What is the likely cause?",
"response": "The ECS service likely failed a deployment. Check the ECS service events for task failures. Common causes: health check failing on the new container version, insufficient memory causing OOM kills, or missing environment variables in the new task definition. Rollback with: aws ecs update-service --cluster prod --service payment-api --force-new-deployment"
}
Runbook Documentation
Convert runbooks into QA format:
{
"instruction": "How do I failover the RDS database to the standby in us-west-2?",
"response": "1. Verify standby status: aws rds describe-db-instances --db-instance-identifier prod-postgres --query 'DBInstances[0].StatusInfos'. 2. Initiate failover: aws rds reboot-db-instance --db-instance-identifier prod-postgres --force-failover. 3. Monitor: watch the RDS events console, failover takes 60-120 seconds. 4. Verify application reconnection in CloudWatch application logs. 5. Update the incident channel with failover completion time."
}
Alert-to-Action Mappings
{
"instruction": "Alert: CPU utilization above 90% on payment-api-prod for 15 minutes. What should I do?",
"response": "This is likely a traffic spike or a resource leak. Steps: 1. Check if there is an ongoing traffic spike in the ALB request count metric. 2. If traffic is normal, check for memory leaks or stuck threads in the application logs. 3. Immediate relief: increase ECS desired count from 4 to 8 with aws ecs update-service --cluster prod --service payment-api --desired-count 8. 4. If CPU does not drop after scaling, the issue is per-container — check application profiling."
}
Target: 500-2000 Examples
| Data Source | Typical Examples | Quality |
|---|---|---|
| Incident postmortems | 50-100 | Highest (real scenarios) |
| Runbook QA pairs | 100-300 | High (verified procedures) |
| Alert-to-action mappings | 100-200 | High (tested responses) |
| Log pattern explanations | 200-500 | Medium (may need review) |
| General infra QA | 100-500 | Medium |
Step 2: Format the Dataset
SageMaker expects JSONL format. Create a script to prepare your data:
# prepare_dataset.py
import json
import random
def format_for_training(examples):
formatted = []
for ex in examples:
formatted.append({
"messages": [
{
"role": "system",
"content": "You are an infrastructure operations assistant for our production AWS environment. Answer based on our specific infrastructure, runbooks, and incident history."
},
{
"role": "user",
"content": ex["instruction"]
},
{
"role": "assistant",
"content": ex["response"]
}
]
})
return formatted
# Load your collected examples
with open("raw_examples.json") as f:
examples = json.load(f)
formatted = format_for_training(examples)
# Split 90/10 train/validation
random.shuffle(formatted)
split = int(len(formatted) * 0.9)
train = formatted[:split]
val = formatted[split:]
# Write JSONL files
with open("train.jsonl", "w") as f:
for item in train:
f.write(json.dumps(item) + "\n")
with open("val.jsonl", "w") as f:
for item in val:
f.write(json.dumps(item) + "\n")
print(f"Training: {len(train)} examples")
print(f"Validation: {len(val)} examples")
Upload to S3:
aws s3 cp train.jsonl s3://your-bucket/fine-tune/train.jsonl
aws s3 cp val.jsonl s3://your-bucket/fine-tune/val.jsonl
Step 3: Fine-Tune with SageMaker
Using the SageMaker JumpStart API
import sagemaker
from sagemaker.jumpstart.estimator import JumpStartEstimator
role = sagemaker.get_execution_role()
session = sagemaker.Session()
estimator = JumpStartEstimator(
model_id="meta-textgeneration-llama-3-1-8b-instruct",
model_version="*",
role=role,
instance_type="ml.g5.2xlarge",
instance_count=1,
environment={
"instruction_tuned": "True",
"epoch": "3",
"learning_rate": "0.0002",
"lora_r": "16",
"lora_alpha": "32",
"per_device_train_batch_size": "4",
"max_input_length": "2048",
}
)
estimator.fit({
"training": f"s3://your-bucket/fine-tune/train.jsonl",
"validation": f"s3://your-bucket/fine-tune/val.jsonl"
})
Training Time and Cost
| Instance | Model | Training Time (1000 examples) | Cost |
|---|---|---|---|
| ml.g5.2xlarge | Llama 3.1 8B (LoRA) | 1-2 hours | $2.40-4.80 |
| ml.g5.12xlarge | Llama 3.1 8B (LoRA) | 30-60 min | $8.50-17.00 |
| ml.g5.2xlarge | Mistral 7B (LoRA) | 1-2 hours | $2.40-4.80 |
| ml.p4d.24xlarge | Llama 3.1 70B (LoRA) | 3-6 hours | $98-196 |
LoRA (Low-Rank Adaptation) is the key. Instead of training all model parameters (expensive, slow), LoRA trains a small adapter layer. The result is 95% of full fine-tuning quality at 5% of the cost.
Step 4: Deploy the Fine-Tuned Model
Deploy to SageMaker Endpoint
predictor = estimator.deploy(
initial_instance_count=1,
instance_type="ml.g5.xlarge",
endpoint_name="infra-assistant-v1"
)
# Test
response = predictor.predict({
"inputs": "The payment API health check is failing after the latest deployment. What should I check?",
"parameters": {
"max_new_tokens": 512,
"temperature": 0.3
}
})
print(response[0]["generated_text"])
Export to Ollama (Self-Hosted)
For running on your own infrastructure instead of paying for a SageMaker endpoint:
# Download the fine-tuned model from S3
aws s3 cp s3://your-bucket/fine-tune/output/model/ ./fine-tuned-model/ --recursive
# Convert to GGUF format (for Ollama)
pip install llama-cpp-python
python convert.py ./fine-tuned-model/ --outtype f16
./quantize ./fine-tuned-model/model-f16.gguf ./infra-assistant-q4.gguf q4_K_M
# Create Ollama Modelfile
cat << 'EOF' > Modelfile
FROM ./infra-assistant-q4.gguf
SYSTEM "You are an infrastructure operations assistant for our production AWS environment."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF
ollama create infra-assistant -f Modelfile
ollama run infra-assistant
Now your fine-tuned model runs locally on Ollama with zero ongoing costs.
Step 5: Evaluate and Iterate
Test Against Known Scenarios
Create a test set of 50 questions with expected answers from your team:
test_cases = [
{
"question": "ECS service payment-api shows 0 running tasks but desired is 4. What happened?",
"expected_keywords": ["deployment", "health check", "task definition", "rollback"]
},
{
"question": "RDS CPU at 95% but application queries look normal. What else could it be?",
"expected_keywords": ["vacuum", "autovacuum", "replication lag", "long-running query"]
}
]
Measure Accuracy
Compare your fine-tuned model against the base model:
| Question Type | Base Llama 3.1 8B | Fine-Tuned Model |
|---|---|---|
| Generic AWS questions | 85% accurate | 85% accurate |
| Your specific infrastructure | 30% accurate | 80% accurate |
| Alert-to-action responses | 20% accurate | 75% accurate |
| Runbook procedures | 10% accurate | 85% accurate |
The fine-tuned model is dramatically better on YOUR infrastructure questions while maintaining general knowledge.
When to Re-Train
- After major infrastructure changes (new services, architecture shifts)
- Every 3-6 months to incorporate new incident learnings
- When accuracy drops on recent scenarios
- After adding new runbooks or procedures
Each re-training run costs $5-20 with LoRA. This is cheap enough to do monthly.
LoRA fine-tuning costs $5-20 per run — cheap enough to iterate frequently
Real Results
After fine-tuning on 1,200 examples from a production e-learning platform:
Before (base Llama 3.1 8B):
Q: “The Tutor MFE is showing a white screen after deployment” A: “This could be a JavaScript error. Check the browser console for errors…” (Generic, not helpful)
After (fine-tuned):
Q: “The Tutor MFE is showing a white screen after deployment” A: “Check the MFE build output in the Tutor Docker logs with
tutor local logs --follow mfe. Common causes: 1. Environment variable MFE_CONFIG_API_URL not set correctly after the domain change. 2. The caddy container needs restart after MFE rebuild:tutor local restart caddy. 3. If using a custom theme, verify the theme compiled successfully in the build step.” (Specific, actionable, correct)
The difference is not subtle. The fine-tuned model knows your platform.
Key Takeaways
- Fine-tuning creates a model that understands YOUR infrastructure, not generic cloud concepts
- 500-2000 training examples is sufficient for significant improvement
- LoRA fine-tuning costs $5-20 per run on SageMaker — cheap enough to iterate
- Export to Ollama for zero ongoing inference costs on your own hardware
- Incident postmortems and runbooks are the highest-quality training data sources
- Re-train every 3-6 months to incorporate new learnings
- Fine-tuned 8B models outperform general 70B models on your specific domain
FAQ
How many training examples do I need?
500 is the minimum for noticeable improvement. 1000-2000 is the sweet spot. Beyond 2000, improvements diminish unless you are covering new domains. Quality matters more than quantity — 500 excellent examples from real incidents beat 5000 synthetic examples.
Will fine-tuning make the model forget general knowledge?
With LoRA, no. LoRA trains a small adapter on top of the base model, preserving general knowledge while adding domain-specific capabilities. Full fine-tuning can cause catastrophic forgetting, which is why LoRA is the recommended approach.
Can I fine-tune on Bedrock instead of SageMaker?
Yes. Bedrock supports fine-tuning for select models (Llama, Titan). Upload training data to S3, configure a fine-tuning job in the Bedrock console, and deploy the customized model. It is simpler than SageMaker but offers less control over hyperparameters.
Is my training data secure on SageMaker?
Yes. Training data stays in your S3 bucket and is processed on dedicated SageMaker instances within your VPC. AWS does not use customer training data to improve its own models. For additional security, use SageMaker VPC mode to keep all traffic private.
Can I share the fine-tuned model across teams?
Yes. Export the LoRA adapter weights (typically 50-200MB) and share via S3 or your artifact registry. Each team member can load the adapter on top of the base model locally. The adapter is much smaller than the full model, making distribution easy.
Conclusion
Fine-tuning is not for every team. If your AI usage is light and general-purpose, Bedrock or Ollama with base models is sufficient.
But if your team answers the same infrastructure questions repeatedly, if your on-call engineers need instant access to tribal knowledge, and if your runbooks contain hundreds of procedures that a base model cannot know — fine-tuning is worth the investment.
The setup takes a day. The training costs under $20. The result is an AI that speaks your infrastructure’s language.
Need help building a custom AI model for your infrastructure? View our Local AI Deployment service
Read next: Build an AI Slack Bot That Answers From Your Runbooks