Running Ollama on Unraid for Local AI Inference
Running AI models locally means no API costs, no data leaving your network, and no rate limits. Ollama makes this straightforward on Unraid, even without a GPU. Here’s how to set it up and what to expect from CPU-only inference.
Installing Ollama on Unraid
Ollama runs as a Docker container. In the Unraid Docker tab, add a new container with these settings:
Repository: ollama/ollama
Network: bridge
Port: default API port (mapped host → container)
Path: /mnt/user/appdata/ollama → /.ollama
Start the container. That’s the entire installation.
Pulling Models
Exec into the container and pull models:
docker exec -it ollama ollama pull qwen2.5:7b
docker exec -it ollama ollama pull qwen2.5-coder:7b
Model sizes on disk:
| Model | Parameters | Disk | RAM Required |
|---|---|---|---|
| qwen2.5:7b | 7B | 4.7 GB | ~6 GB |
| qwen2.5-coder:7b | 7B | 4.7 GB | ~6 GB |
| llama3.1:8b | 8B | 4.9 GB | ~6 GB |
| mistral:7b | 7B | 4.1 GB | ~6 GB |
| phi3:3.8b | 3.8B | 2.3 GB | ~4 GB |
| gemma2:9b | 9B | 5.4 GB | ~7 GB |
For CPU-only inference on Unraid, 7B parameter models are the sweet spot. They fit comfortably in RAM and produce usable output quality. Larger models (13B+) work but response times stretch into minutes.
API Usage
Ollama exposes a REST API on its default port. The generate endpoint handles single prompts:
curl http://localhost:<ollama-port>/api/generate -d '{
"model": "qwen2.5:7b",
"prompt": "Explain YAML anchors in 3 sentences",
"stream": false
}'
The chat endpoint handles multi-turn conversations:
curl http://localhost:<ollama-port>/api/chat -d '{
"model": "qwen2.5:7b",
"messages": [
{"role": "user", "content": "What is a Docker volume?"}
],
"stream": false
}'
Set stream: true (the default) for real-time token streaming. Set stream: false to get the complete response in one JSON object.
CPU Performance
On dual EPYC 7282 processors (32 cores total), a 7B model generates roughly 8-12 tokens per second. A typical response (200 tokens) takes 15-25 seconds. This is fine for:
- Quick summaries and explanations
- Code review comments
- Generating boilerplate text
- Answering simple questions
It’s too slow for:
- Interactive chat (noticeable lag between messages)
- Long-form content generation (multi-paragraph output takes minutes)
- Batch processing large datasets
- Anything requiring real-time responses
Wrapper Script
We use a shell wrapper that logs token usage and handles the API call:
#!/bin/bash
MODEL="${2:-qwen2.5:7b}"
PROMPT="$1"
RESPONSE=$(curl -s http://your-server-ip:<ollama-port>/api/generate \
-d "{\"model\": \"$MODEL\", \"prompt\": \"$PROMPT\", \"stream\": false}")
echo "$RESPONSE" | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(data.get('response', 'No response'))
"
Usage:
bash ollama.sh "Summarize this error message: ..."
bash ollama.sh "Review this YAML" qwen2.5-coder:7b
The code model (qwen2.5-coder:7b) performs better on programming tasks — syntax awareness, code structure, language-specific patterns. The general model (qwen2.5:7b) is better for natural language tasks.
Model Selection
After testing several models for our use cases:
Best general-purpose (7B): qwen2.5:7b — Strong instruction following, good at summarization and analysis. Handles both English and structured output well.
Best for code: qwen2.5-coder:7b — Same family but fine-tuned on code. Better at understanding programming concepts, generating code snippets, and explaining technical errors.
Best small model: phi3:3.8b — Half the size, surprisingly capable for simple tasks. Good fallback when RAM is tight.
Skip: llama3.1:8b is popular but we found Qwen 2.5 more reliable for structured output and instruction following at the 7B size.
Integration with Claude Code
We use Ollama as the cheapest tier in a three-tier model system:
- Ollama (free) — Summaries, simple questions, status checks
- Haiku (cheap) — File searches, moderate analysis
- Opus (expensive) — Complex reasoning, code writing
The Claude Code CLAUDE.md instructions enforce this hierarchy. Before using an expensive model, workers check if Ollama can handle the task. For straightforward questions like “summarize this config file” or “what does this error mean,” Ollama is sufficient and costs nothing.
Network Access
By default, Ollama only accepts connections from localhost. To access it from other containers or devices on your network, ensure the Docker container’s port mapping binds to 0.0.0.0 (not 127.0.0.1).
Once accessible on the network, any device on your LAN (or Tailscale network) can hit the API:
curl http://your-server-ip:<ollama-port>/api/generate -d '{
"model": "qwen2.5:7b",
"prompt": "Hello",
"stream": false
}'
Resource Management
Ollama loads models into RAM on first request and keeps them loaded for 5 minutes of inactivity (configurable via OLLAMA_KEEP_ALIVE). On an Unraid server running 25+ Docker containers, memory management matters.
Set a memory limit on the Ollama container to prevent it from consuming all available RAM:
Extra Parameters: --memory=16g
16GB is comfortable for running one 7B model at a time. If you need to run multiple models simultaneously, increase accordingly.
Monitor with:
docker stats ollama --no-stream
GPU Acceleration
If you add a GPU to your Unraid server, Ollama automatically detects and uses it. NVIDIA GPUs with the Unraid NVIDIA plugin work out of the box:
Extra Parameters: --gpus all
With a GPU, performance jumps dramatically:
| Setup | Tokens/sec (7B) |
|---|---|
| CPU only (EPYC 7282 x2) | 8-12 |
| NVIDIA V100 32GB | 80-120 |
| NVIDIA A6000 48GB | 100-150 |
GPU inference also unlocks larger models (13B, 30B, 70B) with acceptable response times.
What We Use It For
In practice, our Ollama instance handles:
- Quick summaries of config files, log output, and documentation
- Status checks where we need a brief answer to a factual question
- Draft text for non-critical documentation and comments
- Token estimation and basic text analysis
Everything more complex goes to Haiku or Opus. The key is knowing which tasks can tolerate lower quality output. Ollama at 7B isn’t writing production code, but it’s summarizing error logs for free.
Resources
Running Ollama on Unraid requires a capable server with enough RAM for model inference. Here’s what we use: