Why Run AI Models Locally Instead of Using APIs
Every time a request goes to OpenAI, Anthropic, or Google’s APIs, data leaves the machine. For personal projects and experimentation, this is usually fine. For anything touching client data, internal tooling, or proprietary code — it is a problem.
Local AI models solve this. Ollama makes running large language models on local hardware surprisingly straightforward. On a Proxmox homelab — the same setup used for a Kubernetes lab — running private AI models costs nothing beyond electricity and provides complete data control.
This guide walks through setting up Ollama on a Proxmox VM, choosing models that run without a GPU, and using the local inference server as an API backend for tools and scripts.
Ollama handles model downloads, quantization, and serving — run a full LLM in minutes
Hardware Reality: What Actually Runs Without a GPU
The common assumption is that running LLMs requires a high-end GPU. For production-scale inference, that is true. For homelab use, scripting automation, and personal tools, CPU-only inference on smaller quantized models is completely viable.
What determines performance without a GPU:
RAM is the constraint, not CPU cores. Models are loaded into memory. A 7B parameter model in Q4 quantization uses ~4GB RAM. A 13B model uses ~8GB. A 70B model needs 40GB+ — not practical without a GPU.
CPU speed matters more than core count. Inference is largely sequential — more cores help, but a fast single-core CPU outperforms many slow cores for LLM inference.
Practical hardware targets for CPU-only Ollama:
| Model | Parameters | RAM Required | Suitable Hardware |
|---|---|---|---|
| llama3.2:3b | 3B | ~2GB | Any modern machine with 8GB RAM |
| llama3.1:8b | 8B | ~5GB | 16GB RAM minimum |
| mistral:7b | 7B | ~4.5GB | 16GB RAM minimum |
| deepseek-r1:8b | 8B | ~5GB | 16GB RAM minimum |
| llama3.1:70b | 70B | ~40GB | Requires high-end hardware or GPU |
For a Proxmox homelab with 32GB RAM, an 8B model with 8GB allocated to the Ollama VM runs comfortably while other VMs (including Kubernetes nodes) continue running.
Step 1 — Create the Proxmox VM
Create a dedicated VM for Ollama in the Proxmox web interface:
| Setting | Value |
|---|---|
| Name | ollama-server |
| OS | Ubuntu 22.04 LTS |
| CPU | 4 vCPUs |
| RAM | 8–16GB |
| Disk | 60GB SSD (models take 4–8GB each) |
| Network | vmbr0 bridge |
Allocate more disk if planning to run multiple models. A 7B and an 8B model together use ~10GB, plus the OS.
Install Ubuntu 22.04 LTS on the VM. Use a static IP to make it easy to reach from other VMs and the local network.
Step 2 — Install Ollama
SSH into the new VM and install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
The installer sets up Ollama as a systemd service that starts automatically on boot.
Verify the service is running:
systemctl status ollama
By default, Ollama binds to 127.0.0.1:11434. To make it accessible from the local network (other machines, other VMs), configure it to listen on all interfaces:
# Create systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
Now Ollama listens on 0.0.0.0:11434 — accessible from any machine on the local network at http://VM_IP:11434.
Step 3 — Download and Run Models
Pull a model:
ollama pull llama3.1:8b
This downloads the model (4–5GB) to /usr/share/ollama/.ollama/models/. The download is cached — subsequent starts do not re-download.
Run the model interactively:
ollama run llama3.1:8b
This opens a chat interface directly in the terminal. Type a message and press Enter to get a response.
For quick one-shot queries:
echo "Explain Kubernetes readiness probes in 2 sentences" | ollama run llama3.1:8b
For DevOps use cases, some useful models:
# General purpose, strong instruction following
ollama pull llama3.1:8b
# Strong at code and technical tasks
ollama pull deepseek-r1:8b
# Fast, smaller model for quick tasks
ollama pull llama3.2:3b
# Optimized for coding
ollama pull codellama:7b
Step 4 — Use the REST API
Ollama exposes a REST API compatible with the OpenAI API format. This is the most useful feature for automation — any tool that works with the OpenAI API can be pointed at a local Ollama server instead.
Basic completion request:
curl http://192.168.1.110:11434/api/generate \
-d '{
"model": "llama3.1:8b",
"prompt": "Write a bash script to check disk usage on all mounted volumes",
"stream": false
}'
OpenAI-compatible chat endpoint:
curl http://192.168.1.110:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [
{"role": "user", "content": "What is a Kubernetes liveness probe?"}
]
}'
The OpenAI-compatible endpoint means tools like Continue (VS Code extension), Open WebUI, and many others work out of the box with the local Ollama server.
A single Proxmox host can run Kubernetes nodes, an Ollama AI server, and other services simultaneously
Step 5 — Add Open WebUI for a Chat Interface
Open WebUI is an open-source chat interface for Ollama with a design similar to ChatGPT. It runs as a Docker container and connects to the Ollama API.
# Install Docker on the Ollama VM or a separate VM
docker run -d \
--name open-webui \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://192.168.1.110:11434 \
-v open-webui:/app/backend/data \
--restart always \
ghcr.io/open-webui/open-webui:main
Access the interface at http://VM_IP:3000. Create an admin account on first launch.
From the web UI, switch between models, manage conversations, and configure system prompts — all running entirely on local hardware with no external API calls.
Practical DevOps Use Cases
With a local LLM running 24/7 on the homelab, it can be used for automation that would otherwise require API costs:
Log analysis: Pipe log output to the model to summarize errors or identify patterns:
journalctl -u nginx --since "1 hour ago" | \
ollama run llama3.1:8b "Summarize any errors or warnings in these logs"
Bash script generation: Describe what the script should do and get a working draft:
echo "Write a bash script that backs up all PostgreSQL databases to S3 with date-stamped filenames" | \
ollama run codellama:7b
Documentation drafts: Generate README sections, runbook drafts, or incident summaries from bullet points.
Code review: Paste a function and ask for a security or logic review:
cat suspicious_function.py | ollama run llama3.1:8b "Review this code for security issues"
None of these requests leave the homelab network. No API costs. No data privacy concerns.
Key Takeaways
- Ollama makes local LLM inference simple — one install command, one pull, done
- CPU-only inference is practical for 7–8B parameter models with 16GB RAM
- Running Ollama on a dedicated Proxmox VM keeps it always available and isolated
- The OpenAI-compatible API means most AI tools work without modification
- Local AI eliminates API costs and keeps sensitive data on your own hardware
- Open WebUI provides a polished chat interface with zero external dependencies
FAQ
Do I need a GPU to run Ollama?
No. Ollama runs on CPU-only hardware. A GPU significantly speeds up inference — 10–50x faster depending on the GPU — but for smaller models (3B–8B parameters) on fast CPUs, CPU inference is practical for personal use and automation. Expect 5–20 tokens per second on modern CPUs, which is fast enough for most scripting and personal use.
What is the difference between quantized and full-precision models?
Quantization reduces the precision of model weights from 32-bit or 16-bit floating point to 4-bit or 8-bit integers. This dramatically reduces memory usage (a 7B full-precision model needs ~28GB; Q4 quantization reduces it to ~4GB) with modest quality loss. Ollama defaults to Q4 quantization for most models, which is the right choice for CPU inference.
Can I run Ollama on a Raspberry Pi or ARM hardware?
Yes. Ollama supports ARM64 architecture. A Raspberry Pi 5 with 8GB RAM can run 3B parameter models at a few tokens per second — slow but functional. For faster inference on ARM, look at Apple Silicon Macs, which run Ollama significantly faster than comparable x86 hardware due to unified memory architecture.
How do I update models in Ollama?
Pull the model again to get the latest version:
ollama pull llama3.1:8b
If a newer version is available, it replaces the old one. To see all installed models: ollama list.
Is data stored locally when using Ollama?
Yes. All model weights, conversation context, and inference happen entirely on the local machine. No data is sent to any external server. This is the core privacy advantage of local AI — useful for processing internal logs, code, or any sensitive information.
Conclusion
A local AI homelab with Ollama and Proxmox gives full LLM capability at zero API cost, with complete data privacy. The setup takes less than an hour and provides a useful tool for scripting, documentation, and experimentation that keeps running indefinitely.
Start with a small model, use the API for automation, and add Open WebUI for a better interactive experience. The homelab that runs the Kubernetes cluster can run this simultaneously with minimal overhead.
Read next: How to Set Up a Free Kubernetes Cluster on Proxmox →
Interested in AI infrastructure setup for your team? View our DevOps services →