Running Ollama on Unraid for Local AI Inference

Running AI models locally means no API costs, no data leaving your network, and no rate limits. Ollama makes this straightforward on Unraid, even without a GPU. Here’s how to set it up and what to expect from CPU-only inference.

Installing Ollama on Unraid

Ollama runs as a Docker container. In the Unraid Docker tab, add a new container with these settings:

Repository: ollama/ollama
Network: bridge
Port: default API port (mapped host → container)
Path: /mnt/user/appdata/ollama → /.ollama

Start the container. That’s the entire installation.

Pulling Models

Exec into the container and pull models:

docker exec -it ollama ollama pull qwen2.5:7b
docker exec -it ollama ollama pull qwen2.5-coder:7b

Model sizes on disk:

Model	Parameters	Disk	RAM Required
qwen2.5:7b	7B	4.7 GB	~6 GB
qwen2.5-coder:7b	7B	4.7 GB	~6 GB
llama3.1:8b	8B	4.9 GB	~6 GB
mistral:7b	7B	4.1 GB	~6 GB
phi3:3.8b	3.8B	2.3 GB	~4 GB
gemma2:9b	9B	5.4 GB	~7 GB

For CPU-only inference on Unraid, 7B parameter models are the sweet spot. They fit comfortably in RAM and produce usable output quality. Larger models (13B+) work but response times stretch into minutes.

API Usage

Ollama exposes a REST API on its default port. The generate endpoint handles single prompts:

curl http://localhost:<ollama-port>/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Explain YAML anchors in 3 sentences",
  "stream": false
}'

The chat endpoint handles multi-turn conversations:

curl http://localhost:<ollama-port>/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [
    {"role": "user", "content": "What is a Docker volume?"}
  ],
  "stream": false
}'

Set stream: true (the default) for real-time token streaming. Set stream: false to get the complete response in one JSON object.

CPU Performance

On dual EPYC 7282 processors (32 cores total), a 7B model generates roughly 8-12 tokens per second. A typical response (200 tokens) takes 15-25 seconds. This is fine for:

Quick summaries and explanations
Code review comments
Generating boilerplate text
Answering simple questions

It’s too slow for:

Interactive chat (noticeable lag between messages)
Long-form content generation (multi-paragraph output takes minutes)
Batch processing large datasets
Anything requiring real-time responses

Wrapper Script

We use a shell wrapper that logs token usage and handles the API call:

#!/bin/bash
MODEL="${2:-qwen2.5:7b}"
PROMPT="$1"

RESPONSE=$(curl -s http://your-server-ip:<ollama-port>/api/generate \
  -d "{\"model\": \"$MODEL\", \"prompt\": \"$PROMPT\", \"stream\": false}")

echo "$RESPONSE" | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(data.get('response', 'No response'))
"

Usage:

bash ollama.sh "Summarize this error message: ..."
bash ollama.sh "Review this YAML" qwen2.5-coder:7b

The code model (qwen2.5-coder:7b) performs better on programming tasks — syntax awareness, code structure, language-specific patterns. The general model (qwen2.5:7b) is better for natural language tasks.

Model Selection

After testing several models for our use cases:

Best general-purpose (7B): qwen2.5:7b — Strong instruction following, good at summarization and analysis. Handles both English and structured output well.

Best for code: qwen2.5-coder:7b — Same family but fine-tuned on code. Better at understanding programming concepts, generating code snippets, and explaining technical errors.

Best small model: phi3:3.8b — Half the size, surprisingly capable for simple tasks. Good fallback when RAM is tight.

Skip: llama3.1:8b is popular but we found Qwen 2.5 more reliable for structured output and instruction following at the 7B size.

Integration with Claude Code

We use Ollama as the cheapest tier in a three-tier model system:

Ollama (free) — Summaries, simple questions, status checks
Haiku (cheap) — File searches, moderate analysis
Opus (expensive) — Complex reasoning, code writing

The Claude Code CLAUDE.md instructions enforce this hierarchy. Before using an expensive model, workers check if Ollama can handle the task. For straightforward questions like “summarize this config file” or “what does this error mean,” Ollama is sufficient and costs nothing.

Network Access

By default, Ollama only accepts connections from localhost. To access it from other containers or devices on your network, ensure the Docker container’s port mapping binds to 0.0.0.0 (not 127.0.0.1).

Once accessible on the network, any device on your LAN (or Tailscale network) can hit the API:

curl http://your-server-ip:<ollama-port>/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Hello",
  "stream": false
}'

Resource Management

Ollama loads models into RAM on first request and keeps them loaded for 5 minutes of inactivity (configurable via OLLAMA_KEEP_ALIVE). On an Unraid server running 25+ Docker containers, memory management matters.

Set a memory limit on the Ollama container to prevent it from consuming all available RAM:

Extra Parameters: --memory=16g

16GB is comfortable for running one 7B model at a time. If you need to run multiple models simultaneously, increase accordingly.

Monitor with:

docker stats ollama --no-stream

GPU Acceleration

If you add a GPU to your Unraid server, Ollama automatically detects and uses it. NVIDIA GPUs with the Unraid NVIDIA plugin work out of the box:

Extra Parameters: --gpus all

With a GPU, performance jumps dramatically:

Setup	Tokens/sec (7B)
CPU only (EPYC 7282 x2)	8-12
NVIDIA V100 32GB	80-120
NVIDIA A6000 48GB	100-150

GPU inference also unlocks larger models (13B, 30B, 70B) with acceptable response times.

What We Use It For

In practice, our Ollama instance handles:

Quick summaries of config files, log output, and documentation
Status checks where we need a brief answer to a factual question
Draft text for non-critical documentation and comments
Token estimation and basic text analysis

Everything more complex goes to Haiku or Opus. The key is knowing which tasks can tolerate lower quality output. Ollama at 7B isn’t writing production code, but it’s summarizing error logs for free.

Resources

Running Ollama on Unraid requires a capable server with enough RAM for model inference. Here’s what we use:

Unraid OS — The OS that makes all of this possible. Manages Docker containers, VMs, and storage in one interface.
Ollama — Free, open-source local LLM runtime. Install it from the Unraid Community Apps store.