← Back to Blog

Adding V100 GPUs to an Unraid Server for LLM Inference

By Charles

We’re adding 3x NVIDIA V100 PCIe 32GB GPUs to a Dell R7525 Unraid server. This post covers the planning process: what fits, what doesn’t, power considerations, and realistic expectations for inference workloads.

Why V100 in 2026?

The V100 is two generations old (Volta, 2017). Newer options exist — A100, H100, L40S — but the V100 PCIe 32GB hits a price-performance sweet spot for homelab use:

GPUVRAMPrice (Used)FP16 TFLOPS
V100 PCIe 32GB32 GB~$75931.4
A100 PCIe 80GB80 GB~$8,000+77.9
P4024 GB~$1250.5 (no FP16)
A600048 GB~$3,500+38.7

At $759 per card, three V100s cost $2,277 for 96GB of VRAM. That’s enough to run 70B parameter models at FP16 precision — models that perform near GPT-4 quality for many tasks.

The P40 at $125 is tempting but lacks native FP16 support, making it impractical for modern LLM inference. The A100 is superior but costs 10x more per card.

Power Budget

The R7525 has dual 1400W platinum power supplies in redundant mode, giving 1,400W usable. Current system draw is approximately 480W (dual EPYC 7282 + 12 HDDs + 2 SSDs). Each V100 PCIe draws up to 250W under full load.

Current system:     ~480W
3x V100 (max load): 750W
Total peak:         1,230W
PSU capacity:       1,400W
Headroom:            170W (12%)

This is tight but workable. In practice, sustained inference doesn’t hit peak GPU power — typical draw is 180-220W per card. Real-world total would be closer to 1,050-1,100W.

What Models Fit

With 2x V100 (64GB VRAM) using tensor parallelism:

ModelParametersVRAM (FP16)Fits?
Llama 3.1 70B70B~40 GBYes
Qwen 2.5 72B72B~42 GBYes
Mixtral 8x7B46.7B~28 GBYes
Llama 3.1 8B8B~5 GBYes (single GPU)
Llama 3.1 405B405B~240 GBNo

The third GPU (32GB standalone) can run a separate model via Ollama — a 32B parameter model for local development while the production pair serves a 70B model.

The PCIe Bandwidth Problem

V100 PCIe cards communicate at 32 GB/s over the PCIe bus. The V100 SXM2 variant uses NVLink at 300 GB/s — nearly 10x faster. This matters for tensor parallelism, where GPUs exchange data constantly during inference.

In practice, this means:

  • Single-GPU workloads perform identically on PCIe vs SXM2
  • Multi-GPU tensor parallel workloads see 15-30% lower throughput on PCIe
  • Pipeline parallelism (splitting model layers across GPUs) is less affected

For our use case — serving a 70B model across 2 GPUs — the PCIe bandwidth is acceptable. We’re not training, just running inference where the bottleneck is usually memory capacity, not bandwidth.

Tensor Parallelism Gotcha

Tensor parallelism requires an even number of GPUs (2, 4, 8). With 3 GPUs, the options are:

  1. TP=2 on GPUs 0+1 + GPU 2 standalone (recommended)
  2. TP=2 + PP=1 — pipeline parallelism adds complexity for marginal gain
  3. All 3 independent — each runs a smaller model (up to 32B each)

Option 1 is the practical choice: a 70B model on two GPUs for production, plus a separate model on the third for development or a different use case.

Serving Framework: vLLM

For production inference, vLLM is the standard choice:

  • Continuous batching (handles concurrent requests efficiently)
  • PagedAttention (optimizes VRAM usage)
  • OpenAI-compatible API (drop-in replacement)
  • Tensor parallelism support out of the box

A vLLM container with --tensor-parallel-size 2 on GPUs 0 and 1 serves an OpenAI-compatible API on port 8000. Any application that speaks the OpenAI API format works without modification.

Expected performance with V100 TP=2:

  • 70B model: ~30-50 tokens/sec
  • Time to first token: ~200-400ms
  • Concurrent users: 10-20 without degradation

Revenue Potential

Three V100 GPUs can generate passive income through GPU rental platforms:

PlatformModelExpected Revenue
Vast.aiPer-GPU hourly rental$0.20-0.30/GPU/hr
Custom APIPer-token pricingVariable
DedicatedMonthly lease$200-400/mo

At $0.22/GPU/hr with 80% utilization across 3 GPUs:

  • Monthly gross: ~$346
  • Power cost: ~$65/mo
  • Net profit: ~$281/mo
  • Break-even on hardware: ~8 months

This assumes the GPUs are rented most of the time. Utilization below 50% extends break-even significantly.

Bonus: GPU Transcoding

Beyond inference, the V100’s NVENC encoder handles video transcoding. Tdarr with GPU workers can transcode H.264 to H.265 at 10-50x CPU speed — relevant for a media server with 174TB of content.

CPU transcoding on EPYC 7282: ~5 files/day GPU transcoding on V100: ~100+ files/day

On a library with thousands of remux files, GPU transcoding recovers storage space significantly faster.

Pre-Installation Prep

Before the GPUs arrive:

  • Verify Unraid NVIDIA Driver plugin is compatible with current kernel
  • Download vLLM Docker image
  • Create GPU monitoring script
  • Document current power draw for baseline comparison
  • Research Vast.ai host onboarding process
  • Pre-download target models (70B model is ~40GB)

Having everything ready means the GPUs go from unboxed to earning revenue in hours, not days.


Resources

The hardware and software mentioned in this post:

  • NVIDIA V100 PCIe 32GB — The GPU we’re installing. Look for used/refurbished units around $750.
  • Unraid OS — Server OS with built-in Docker, VM support, and NVIDIA GPU passthrough via plugin.
  • vLLM — Production inference server with tensor parallelism and OpenAI-compatible API. Free and open-source.