Adding V100 GPUs to an Unraid Server for LLM Inference

We’re adding 3x NVIDIA V100 PCIe 32GB GPUs to a Dell R7525 Unraid server. This post covers the planning process: what fits, what doesn’t, power considerations, and realistic expectations for inference workloads.

Why V100 in 2026?

The V100 is two generations old (Volta, 2017). Newer options exist — A100, H100, L40S — but the V100 PCIe 32GB hits a price-performance sweet spot for homelab use:

GPU	VRAM	Price (Used)	FP16 TFLOPS
V100 PCIe 32GB	32 GB	~$759	31.4
A100 PCIe 80GB	80 GB	~$8,000+	77.9
P40	24 GB	~$125	0.5 (no FP16)
A6000	48 GB	~$3,500+	38.7

At $759 per card, three V100s cost $2,277 for 96GB of VRAM. That’s enough to run 70B parameter models at FP16 precision — models that perform near GPT-4 quality for many tasks.

The P40 at $125 is tempting but lacks native FP16 support, making it impractical for modern LLM inference. The A100 is superior but costs 10x more per card.

Power Budget

The R7525 has dual 1400W platinum power supplies in redundant mode, giving 1,400W usable. Current system draw is approximately 480W (dual EPYC 7282 + 12 HDDs + 2 SSDs). Each V100 PCIe draws up to 250W under full load.

Current system:     ~480W
3x V100 (max load): 750W
Total peak:         1,230W
PSU capacity:       1,400W
Headroom:            170W (12%)

This is tight but workable. In practice, sustained inference doesn’t hit peak GPU power — typical draw is 180-220W per card. Real-world total would be closer to 1,050-1,100W.

What Models Fit

With 2x V100 (64GB VRAM) using tensor parallelism:

Model	Parameters	VRAM (FP16)	Fits?
Llama 3.1 70B	70B	~40 GB	Yes
Qwen 2.5 72B	72B	~42 GB	Yes
Mixtral 8x7B	46.7B	~28 GB	Yes
Llama 3.1 8B	8B	~5 GB	Yes (single GPU)
Llama 3.1 405B	405B	~240 GB	No

The third GPU (32GB standalone) can run a separate model via Ollama — a 32B parameter model for local development while the production pair serves a 70B model.

The PCIe Bandwidth Problem

V100 PCIe cards communicate at 32 GB/s over the PCIe bus. The V100 SXM2 variant uses NVLink at 300 GB/s — nearly 10x faster. This matters for tensor parallelism, where GPUs exchange data constantly during inference.

In practice, this means:

Single-GPU workloads perform identically on PCIe vs SXM2
Multi-GPU tensor parallel workloads see 15-30% lower throughput on PCIe
Pipeline parallelism (splitting model layers across GPUs) is less affected

For our use case — serving a 70B model across 2 GPUs — the PCIe bandwidth is acceptable. We’re not training, just running inference where the bottleneck is usually memory capacity, not bandwidth.

Tensor Parallelism Gotcha

Tensor parallelism requires an even number of GPUs (2, 4, 8). With 3 GPUs, the options are:

TP=2 on GPUs 0+1 + GPU 2 standalone (recommended)
TP=2 + PP=1 — pipeline parallelism adds complexity for marginal gain
All 3 independent — each runs a smaller model (up to 32B each)

Option 1 is the practical choice: a 70B model on two GPUs for production, plus a separate model on the third for development or a different use case.

Serving Framework: vLLM

For production inference, vLLM is the standard choice:

Continuous batching (handles concurrent requests efficiently)
PagedAttention (optimizes VRAM usage)
OpenAI-compatible API (drop-in replacement)
Tensor parallelism support out of the box

A vLLM container with --tensor-parallel-size 2 on GPUs 0 and 1 serves an OpenAI-compatible API on port 8000. Any application that speaks the OpenAI API format works without modification.

Expected performance with V100 TP=2:

70B model: ~30-50 tokens/sec
Time to first token: ~200-400ms
Concurrent users: 10-20 without degradation

Revenue Potential

Three V100 GPUs can generate passive income through GPU rental platforms:

Platform	Model	Expected Revenue
Vast.ai	Per-GPU hourly rental	$0.20-0.30/GPU/hr
Custom API	Per-token pricing	Variable
Dedicated	Monthly lease	$200-400/mo

At $0.22/GPU/hr with 80% utilization across 3 GPUs:

Monthly gross: ~$346
Power cost: ~$65/mo
Net profit: ~$281/mo
Break-even on hardware: ~8 months

This assumes the GPUs are rented most of the time. Utilization below 50% extends break-even significantly.

Bonus: GPU Transcoding

Beyond inference, the V100’s NVENC encoder handles video transcoding. Tdarr with GPU workers can transcode H.264 to H.265 at 10-50x CPU speed — relevant for a media server with 174TB of content.

CPU transcoding on EPYC 7282: ~5 files/day GPU transcoding on V100: ~100+ files/day

On a library with thousands of remux files, GPU transcoding recovers storage space significantly faster.

Pre-Installation Prep

Before the GPUs arrive:

Verify Unraid NVIDIA Driver plugin is compatible with current kernel
Download vLLM Docker image
Create GPU monitoring script
Document current power draw for baseline comparison
Research Vast.ai host onboarding process
Pre-download target models (70B model is ~40GB)

Having everything ready means the GPUs go from unboxed to earning revenue in hours, not days.

Resources

The hardware and software mentioned in this post:

NVIDIA V100 PCIe 32GB — The GPU we’re installing. Look for used/refurbished units around $750.
Unraid OS — Server OS with built-in Docker, VM support, and NVIDIA GPU passthrough via plugin.
vLLM — Production inference server with tensor parallelism and OpenAI-compatible API. Free and open-source.