Local LLM
ActiveSelf-hosted AI inference with Ollama running on 3x NVIDIA Tesla V100 GPUs (96GB VRAM). Serving an 80B parameter model at 42 tokens/sec for development, personal assistant, and content generation — zero cloud dependency.
Configuration
GPUs
3x Tesla V100-PCIE-32GB
Total VRAM
96GB
Speed
~42 tok/s (80B model)
Access
LAN only — not exposed
Case Study: GPU-Accelerated Local AI Infrastructure
The Challenge
Run production AI inference for a personal assistant (Nova), 4 parallel AI coding workers, blog generation, and voice transcription — without recurring cloud API costs eating into a bootstrapped business budget.
The Solution
- ✓ Deployed 3x NVIDIA Tesla V100 GPUs with Ollama serving 80B MoE model across all 3
- ✓ Custom model import pipeline (sharded GGUF download, merge, Ollama import) for models too large for standard pull
- ✓ GPU-accelerated Whisper (large-v3-turbo) for voice transcription on dedicated GPU
- ✓ Integrated with Claude Code workers — Ollama handles ~90% of queries for free
- ✓ Nova personal assistant routes queries through 3 tiers: regex (free) → Ollama (free) → Claude API (paid)
- ✓ Weekly blog posts auto-generated via llama3.3:70b pipeline
The Results
Available Models
| Model | Parameters | Purpose | VRAM |
|---|---|---|---|
| qwen3-coder-next | 80B MoE | Primary model — coding, reasoning, tool calling, Nova assistant | ~66GB (Q6_K across 3 GPUs) |
| llama3.3:70b | 70B | Long-form content generation, weekly blog posts | ~57GB (Q6_K) |
Benefits
Cost Reduction
Free inference for simple tasks that would otherwise use paid API calls. Saves money on summarization, status checks, and simple questions.
Privacy
Sensitive code and data never leaves the local network. No cloud provider sees your prompts or responses.
Speed
Local inference with no network latency. Responses start immediately without waiting for API round-trips.
Availability
Works offline and during API outages. Not dependent on external service availability.
Integrations
| Claude Code Workers | Workers use Ollama for simple tasks before falling back to paid APIs (~90% handled locally) |
| Nova Assistant | Personal AI assistant uses Ollama as primary model with Claude API fallback |
| Ollama MCP Server | Model Context Protocol server for standardized LLM access |
| Blog Generation | Weekly blog posts auto-generated via llama3.3:70b pipeline |
| Whisper Transcription | GPU-accelerated speech-to-text for voice input via Nova |