AI in Infrastructure Work: What Actually Helps
AI coding assistants have been mainstream for two years. The DevOps opinions on them range from “I use it for everything” to “I don’t trust it with production code.” Both positions are partially right.
The honest assessment: AI tools genuinely accelerate specific tasks in infrastructure work. They also produce confidently wrong output for other tasks, and using that output in production without understanding it is how incidents happen.
This guide covers where AI tools provide real value in DevOps and SRE work, where they fall short, and how to integrate them without adding risk.
AI tools accelerate infrastructure work — but understanding the output before applying it is non-negotiable
Where AI Tools Genuinely Help
Boilerplate and Scaffolding
Infrastructure code has a lot of repetitive structure. A Terraform module for an EC2 instance with security groups, IAM role, and CloudWatch alarms follows a predictable pattern. Writing this from scratch is tedious. Asking an AI tool to scaffold it is fast, and the output is usually a correct starting point that needs review rather than a complete rewrite.
The same applies to:
- Kubernetes manifests (Deployments, Services, ConfigMaps)
- Ansible tasks for common operations (installing packages, configuring services)
- GitHub Actions or GitLab CI pipeline stages
- Dockerfile scaffolding for standard application patterns
The key distinction: scaffolding generates a starting point. It does not generate production-ready code. Reviewing and understanding every line before applying it is mandatory.
Documentation and Runbooks
Writing runbooks is important work that engineers frequently defer because it takes time. AI tools are good at generating first drafts from bullet points.
Give it the key steps and it produces readable prose:
Prompt: Write a runbook section for rotating RDS database credentials.
Steps: 1) Create new password in Secrets Manager, 2) Update RDS instance,
3) Update application environment variables, 4) Verify connectivity,
5) Delete old password
The output needs review and customization for the specific environment, but a first draft in two minutes versus thirty is a genuine productivity gain. The same applies to post-incident reports, architecture decision records, and technical documentation.
Explaining Unfamiliar Code and Error Messages
When debugging an unfamiliar service or a cryptic error message, AI tools are useful for getting oriented quickly.
Paste a complex regex, an iptables rule, or an AWS error code and get a plain-language explanation. This does not replace understanding the underlying system — but it accelerates the path to understanding it.
# Example: paste an error like this to get an explanation
Error: InvalidParameterCombination:
The parameter 'MultiAZ' cannot be used with the parameter 'AvailabilityZone'
AI tools explain these consistently well because common error messages appear frequently in their training data.
Writing and Debugging Shell Scripts
For bash scripting, AI tools produce reliable output for standard operations — parsing log files, batch processing files, conditional logic, AWS CLI commands. The quality drops on scripts with complex edge cases or unusual system interactions.
Pattern: write the script with AI assistance, test it in a development environment, review the logic before running it on production systems. Never pipe AI-generated scripts directly to bash without reading them first.
Always test AI-generated scripts in a dev environment — confident-sounding output can still be wrong
Where AI Tools Fail in Infrastructure Work
Anything Requiring Current Knowledge
AI models have training cutoffs. AWS adds new services, changes API behavior, and deprecates old options regularly. An AI tool asked about AWS features from 2024–2025 may have outdated or incorrect information.
Examples where this causes problems:
- Specific IAM policy syntax that changed
- New Kubernetes admission controller requirements
- Updated Terraform provider argument names
- Recent CVEs and their remediation
Rule: For anything security-related, version-specific, or recently introduced — check the official documentation. Do not trust AI output as authoritative on current AWS behavior.
Multi-System Interactions and State
AI tools reason about individual components well but struggle with interactions between systems. An AI can write a correct Terraform resource definition for an RDS instance and a correct security group, but may not correctly wire the security group rules to account for specific VPC routing and NACLs in an existing environment.
The more context a task requires about the existing environment — current subnet configuration, existing security group rules, running services, network topology — the less reliable AI-generated output becomes.
Rule: For tasks that require understanding of existing infrastructure state, write the code manually or provide explicit context about the environment in the prompt.
Production Troubleshooting Under Pressure
During an active incident, AI tools can help explain concepts or suggest diagnostic commands. They cannot replace the judgment of an engineer who understands the specific system architecture, what changed recently, and what the monitoring data actually means.
The danger during incidents: an AI tool confidently suggests a fix that sounds plausible but is wrong for the specific situation. Applying it under time pressure without understanding it can make a bad situation worse.
Rule: During incidents, use AI tools for quick reference lookups (how to read a specific metric, what a flag does) — not for generating remediation steps. Human judgment on production systems during incidents is not optional.
Practical Integration Patterns
Local AI for Sensitive Work
For tasks involving internal code, client infrastructure details, or sensitive configuration, sending queries to cloud AI APIs is a data privacy concern. Running a local Ollama server on a homelab provides the same assistance without data leaving the network.
For infrastructure work on client accounts, this distinction matters. Many organizations have policies against sending internal code to third-party AI services.
AI-Assisted Code Review Checklist
Before applying any AI-generated infrastructure code:
- Read every line — do not apply code you cannot explain
- Check resource naming matches your conventions
- Verify IAM permissions are minimal (principle of least privilege)
- Confirm region and account context is correct
- Test in a non-production environment first
- For Terraform: run
planand review the diff beforeapply - For Kubernetes: apply to a staging namespace first
Prompt Engineering for Infrastructure Tasks
Specific prompts produce better output than generic ones. Context about the environment, constraints, and specific requirements dramatically improves output quality.
Less effective:
Write Terraform for an S3 bucket
More effective:
Write a Terraform resource for an S3 bucket with:
- Server-side encryption using KMS (key ARN will be provided as variable)
- Versioning enabled
- Public access blocked
- Lifecycle rule: move to IA after 30 days, Glacier after 90 days
- Access logging to a separate bucket named var.log_bucket
- Tags: Environment and Team from variables
The second prompt gets output that is 80% ready to use. The first gets generic output that needs significant customization.
The Right Mental Model
AI tools are junior engineers with broad knowledge, no context about the specific environment, and no ability to detect when they are wrong. They produce plausible output — not necessarily correct output.
The productivity gain is real when used correctly: faster scaffolding, better documentation, quicker orientation in unfamiliar territory. The risk is real when used incorrectly: applying generated infrastructure code without understanding it.
The mental model that works: AI generates a first draft, the engineer reviews and takes ownership. The engineer is responsible for what gets applied to production, regardless of what generated it.
This applies to all infrastructure tooling — the AWS cost optimization work and disaster recovery planning still require human judgment. AI tools can help write the scripts, but the strategy and the decisions are the engineer’s responsibility.
Key Takeaways
- AI tools genuinely accelerate scaffolding, documentation, and explaining unfamiliar code
- They fail at tasks requiring current knowledge, environment context, and multi-system reasoning
- Never apply AI-generated infrastructure code without reading and understanding it
- Use local AI (Ollama) for tasks involving sensitive or internal data
- Specific, contextual prompts produce significantly better infrastructure output
- The engineer is responsible for production systems — AI is a tool, not a decision-maker
FAQ
Which AI tool is best for DevOps and infrastructure work?
For code completion in an IDE, GitHub Copilot and Cursor are the most commonly used. For conversational assistance and longer tasks, Claude and GPT-4 produce strong results for infrastructure work. For tasks with sensitive data, local models via Ollama are the right choice. Use whichever fits your workflow — the underlying models are more similar than the marketing suggests.
Can AI tools write production Terraform safely?
AI tools can write syntactically correct Terraform that follows common patterns. Whether it is production-safe depends entirely on review quality. The code needs to be read and understood before use. The bigger risk is not syntax errors — it is subtle IAM permission issues, missing security configurations, or incorrect resource relationships that look plausible but have problems.
Will AI tools replace DevOps engineers?
Infrastructure work requires judgment about tradeoffs, understanding of existing systems, incident response under pressure, and ownership of production systems. These are not tasks that current AI tools perform reliably. The more likely outcome is that AI tools make individual engineers more productive, similar to how IDEs and automation changed the work without replacing engineers.
How do I use AI tools for Kubernetes YAML generation?
Kubernetes YAML is well-represented in AI training data and AI tools generate it reliably for standard workloads (Deployments, Services, ConfigMaps, Ingress). Always validate generated manifests with kubectl apply --dry-run=client before applying. For complex configurations (custom admission webhooks, operator CRDs), manually written YAML with official documentation is more reliable.
What are the security risks of using AI coding tools?
The main risks: sending internal code or configuration to third-party APIs (data privacy), applying generated code that contains subtle security misconfigurations (IAM policies that are too permissive, missing encryption settings), and using outdated security practices from the model’s training data. Mitigations: use local AI for sensitive work, review all generated code for security implications, and verify security-critical configurations against current AWS documentation.
Conclusion
AI tools are genuine productivity tools for infrastructure work — not replacements for engineering judgment. The engineers who use them most effectively treat them as fast research assistants and first-draft generators, not as authoritative sources of correct infrastructure code.
Use them for the tasks where they help. Be skeptical for the tasks where they fail. Always understand what gets applied to production.
Read next: Ollama + Proxmox: Build a Private AI Homelab on Old Hardware →
Looking for help with your cloud infrastructure? View our DevOps services →