How to Fine-Tune a Small LLM on Your AWS Infrastructure Logs (SageMaker Guide)

Why Fine-Tune Instead of Prompting

General-purpose models like Claude and GPT-4 understand infrastructure concepts. But they do not understand YOUR infrastructure — your naming conventions, your alert patterns, your specific failure modes, your runbook terminology.

Fine-tuning a small model on your own data creates an AI that speaks your team’s language. When it sees payment-api-prod-503-spike, it knows that means the payment service behind the ALB in us-east-1, not a generic HTTP error explanation.

The trade-off: fine-tuning requires effort upfront. But the result is a model that handles your routine questions better than any general-purpose model — at a fraction of the inference cost.

Data analytics dashboard representing model training metrics Fine-tuning teaches a small model the patterns specific to your infrastructure

What You Need

AWS account with SageMaker access
500-2000 training examples (question-answer pairs from your infrastructure)
A base model (Llama 3.1 8B or Mistral 7B recommended)
2-4 hours for the fine-tuning job
Budget: $20-80 for a single training run on ml.g5.2xlarge

Step 1: Collect Training Data

The quality of your fine-tuned model depends entirely on training data quality. Collect from these sources:

CloudWatch Logs and Alerts

# Export recent CloudWatch log insights
aws logs start-query \
  --log-group-name "/aws/ecs/payment-api" \
  --start-time $(date -d '30 days ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR|WARN|Exception/ | limit 1000'

Incident Postmortems

Extract question-answer pairs from your existing postmortems:

{
  "instruction": "The payment API is returning 503 errors and CloudWatch shows ECS task count dropped from 4 to 1. What is the likely cause?",
  "response": "The ECS service likely failed a deployment. Check the ECS service events for task failures. Common causes: health check failing on the new container version, insufficient memory causing OOM kills, or missing environment variables in the new task definition. Rollback with: aws ecs update-service --cluster prod --service payment-api --force-new-deployment"
}

Runbook Documentation

Convert runbooks into QA format:

{
  "instruction": "How do I failover the RDS database to the standby in us-west-2?",
  "response": "1. Verify standby status: aws rds describe-db-instances --db-instance-identifier prod-postgres --query 'DBInstances[0].StatusInfos'. 2. Initiate failover: aws rds reboot-db-instance --db-instance-identifier prod-postgres --force-failover. 3. Monitor: watch the RDS events console, failover takes 60-120 seconds. 4. Verify application reconnection in CloudWatch application logs. 5. Update the incident channel with failover completion time."
}

Alert-to-Action Mappings

{
  "instruction": "Alert: CPU utilization above 90% on payment-api-prod for 15 minutes. What should I do?",
  "response": "This is likely a traffic spike or a resource leak. Steps: 1. Check if there is an ongoing traffic spike in the ALB request count metric. 2. If traffic is normal, check for memory leaks or stuck threads in the application logs. 3. Immediate relief: increase ECS desired count from 4 to 8 with aws ecs update-service --cluster prod --service payment-api --desired-count 8. 4. If CPU does not drop after scaling, the issue is per-container — check application profiling."
}

Target: 500-2000 Examples

Data Source	Typical Examples	Quality
Incident postmortems	50-100	Highest (real scenarios)
Runbook QA pairs	100-300	High (verified procedures)
Alert-to-action mappings	100-200	High (tested responses)
Log pattern explanations	200-500	Medium (may need review)
General infra QA	100-500	Medium

Step 2: Format the Dataset

SageMaker expects JSONL format. Create a script to prepare your data:

# prepare_dataset.py
import json
import random

def format_for_training(examples):
    formatted = []
    for ex in examples:
        formatted.append({
            "messages": [
                {
                    "role": "system",
                    "content": "You are an infrastructure operations assistant for our production AWS environment. Answer based on our specific infrastructure, runbooks, and incident history."
                },
                {
                    "role": "user", 
                    "content": ex["instruction"]
                },
                {
                    "role": "assistant",
                    "content": ex["response"]
                }
            ]
        })
    return formatted

# Load your collected examples
with open("raw_examples.json") as f:
    examples = json.load(f)

formatted = format_for_training(examples)

# Split 90/10 train/validation
random.shuffle(formatted)
split = int(len(formatted) * 0.9)
train = formatted[:split]
val = formatted[split:]

# Write JSONL files
with open("train.jsonl", "w") as f:
    for item in train:
        f.write(json.dumps(item) + "\n")

with open("val.jsonl", "w") as f:
    for item in val:
        f.write(json.dumps(item) + "\n")

print(f"Training: {len(train)} examples")
print(f"Validation: {len(val)} examples")

Upload to S3:

aws s3 cp train.jsonl s3://your-bucket/fine-tune/train.jsonl
aws s3 cp val.jsonl s3://your-bucket/fine-tune/val.jsonl

Step 3: Fine-Tune with SageMaker

Using the SageMaker JumpStart API

import sagemaker
from sagemaker.jumpstart.estimator import JumpStartEstimator

role = sagemaker.get_execution_role()
session = sagemaker.Session()

estimator = JumpStartEstimator(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    model_version="*",
    role=role,
    instance_type="ml.g5.2xlarge",
    instance_count=1,
    environment={
        "instruction_tuned": "True",
        "epoch": "3",
        "learning_rate": "0.0002",
        "lora_r": "16",
        "lora_alpha": "32",
        "per_device_train_batch_size": "4",
        "max_input_length": "2048",
    }
)

estimator.fit({
    "training": f"s3://your-bucket/fine-tune/train.jsonl",
    "validation": f"s3://your-bucket/fine-tune/val.jsonl"
})

Training Time and Cost

Instance	Model	Training Time (1000 examples)	Cost
ml.g5.2xlarge	Llama 3.1 8B (LoRA)	1-2 hours	$2.40-4.80
ml.g5.12xlarge	Llama 3.1 8B (LoRA)	30-60 min	$8.50-17.00
ml.g5.2xlarge	Mistral 7B (LoRA)	1-2 hours	$2.40-4.80
ml.p4d.24xlarge	Llama 3.1 70B (LoRA)	3-6 hours	$98-196

LoRA (Low-Rank Adaptation) is the key. Instead of training all model parameters (expensive, slow), LoRA trains a small adapter layer. The result is 95% of full fine-tuning quality at 5% of the cost.

Step 4: Deploy the Fine-Tuned Model

Deploy to SageMaker Endpoint

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name="infra-assistant-v1"
)

# Test
response = predictor.predict({
    "inputs": "The payment API health check is failing after the latest deployment. What should I check?",
    "parameters": {
        "max_new_tokens": 512,
        "temperature": 0.3
    }
})

print(response[0]["generated_text"])

Export to Ollama (Self-Hosted)

For running on your own infrastructure instead of paying for a SageMaker endpoint:

# Download the fine-tuned model from S3
aws s3 cp s3://your-bucket/fine-tune/output/model/ ./fine-tuned-model/ --recursive

# Convert to GGUF format (for Ollama)
pip install llama-cpp-python
python convert.py ./fine-tuned-model/ --outtype f16
./quantize ./fine-tuned-model/model-f16.gguf ./infra-assistant-q4.gguf q4_K_M

# Create Ollama Modelfile
cat << 'EOF' > Modelfile
FROM ./infra-assistant-q4.gguf

SYSTEM "You are an infrastructure operations assistant for our production AWS environment."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF

ollama create infra-assistant -f Modelfile
ollama run infra-assistant

Now your fine-tuned model runs locally on Ollama with zero ongoing costs.

Step 5: Evaluate and Iterate

Test Against Known Scenarios

Create a test set of 50 questions with expected answers from your team:

test_cases = [
    {
        "question": "ECS service payment-api shows 0 running tasks but desired is 4. What happened?",
        "expected_keywords": ["deployment", "health check", "task definition", "rollback"]
    },
    {
        "question": "RDS CPU at 95% but application queries look normal. What else could it be?",
        "expected_keywords": ["vacuum", "autovacuum", "replication lag", "long-running query"]
    }
]

Measure Accuracy

Compare your fine-tuned model against the base model:

Question Type	Base Llama 3.1 8B	Fine-Tuned Model
Generic AWS questions	85% accurate	85% accurate
Your specific infrastructure	30% accurate	80% accurate
Alert-to-action responses	20% accurate	75% accurate
Runbook procedures	10% accurate	85% accurate

The fine-tuned model is dramatically better on YOUR infrastructure questions while maintaining general knowledge.

When to Re-Train

After major infrastructure changes (new services, architecture shifts)
Every 3-6 months to incorporate new incident learnings
When accuracy drops on recent scenarios
After adding new runbooks or procedures

Each re-training run costs $5-20 with LoRA. This is cheap enough to do monthly.

Terminal showing model training progress LoRA fine-tuning costs $5-20 per run — cheap enough to iterate frequently

Real Results

After fine-tuning on 1,200 examples from a production e-learning platform:

Before (base Llama 3.1 8B):

Q: “The Tutor MFE is showing a white screen after deployment” A: “This could be a JavaScript error. Check the browser console for errors…” (Generic, not helpful)

After (fine-tuned):

Q: “The Tutor MFE is showing a white screen after deployment” A: “Check the MFE build output in the Tutor Docker logs with tutor local logs --follow mfe. Common causes: 1. Environment variable MFE_CONFIG_API_URL not set correctly after the domain change. 2. The caddy container needs restart after MFE rebuild: tutor local restart caddy. 3. If using a custom theme, verify the theme compiled successfully in the build step.” (Specific, actionable, correct)

The difference is not subtle. The fine-tuned model knows your platform.

Key Takeaways

Fine-tuning creates a model that understands YOUR infrastructure, not generic cloud concepts
500-2000 training examples is sufficient for significant improvement
LoRA fine-tuning costs $5-20 per run on SageMaker — cheap enough to iterate
Export to Ollama for zero ongoing inference costs on your own hardware
Incident postmortems and runbooks are the highest-quality training data sources
Re-train every 3-6 months to incorporate new learnings
Fine-tuned 8B models outperform general 70B models on your specific domain

FAQ

How many training examples do I need?

500 is the minimum for noticeable improvement. 1000-2000 is the sweet spot. Beyond 2000, improvements diminish unless you are covering new domains. Quality matters more than quantity — 500 excellent examples from real incidents beat 5000 synthetic examples.

Will fine-tuning make the model forget general knowledge?

With LoRA, no. LoRA trains a small adapter on top of the base model, preserving general knowledge while adding domain-specific capabilities. Full fine-tuning can cause catastrophic forgetting, which is why LoRA is the recommended approach.

Can I fine-tune on Bedrock instead of SageMaker?

Yes. Bedrock supports fine-tuning for select models (Llama, Titan). Upload training data to S3, configure a fine-tuning job in the Bedrock console, and deploy the customized model. It is simpler than SageMaker but offers less control over hyperparameters.

Is my training data secure on SageMaker?

Yes. Training data stays in your S3 bucket and is processed on dedicated SageMaker instances within your VPC. AWS does not use customer training data to improve its own models. For additional security, use SageMaker VPC mode to keep all traffic private.

Yes. Export the LoRA adapter weights (typically 50-200MB) and share via S3 or your artifact registry. Each team member can load the adapter on top of the base model locally. The adapter is much smaller than the full model, making distribution easy.

Conclusion

Fine-tuning is not for every team. If your AI usage is light and general-purpose, Bedrock or Ollama with base models is sufficient.

But if your team answers the same infrastructure questions repeatedly, if your on-call engineers need instant access to tribal knowledge, and if your runbooks contain hundreds of procedures that a base model cannot know — fine-tuning is worth the investment.

The setup takes a day. The training costs under $20. The result is an AI that speaks your infrastructure’s language.

Need help building a custom AI model for your infrastructure? View our Local AI Deployment service