ADR-0019: Local LLM Strategy (Hybrid Architecture)
ADR-0019: Local LLM Strategy (Hybrid Architecture)
Reviewed on 2026-03-06.
This accepted ADR remains useful as historical design lineage, but current local-model and fallback truth now lives in:
docs/configuration/MODEL_POLICY_CURRENT.mddocs/architecture/CURRENT_ARCHITECTURE.mddocs/FEATURE_MATRIX.md
Date: 2026-01-23 Status: Accepted Author: Rovo (Architect Agent) Context: Task-Voice completion, Phase P (Production Deployment - Local-First) Implementation Status Reviewed: Historical implementation note retained; verify current state via canonical model-policy docs.
1. Context
Jorvis aims to be a local-first analytics platform. For local LLM inference, we need a strategy that:
- Maximizes performance on Apple Silicon (M4 Pro MAX)
- Maintains compatibility with Docker-based Open WebUI
- Provides fallback to cloud when local inference is unavailable
Hardware Context
- Host Machine: Apple M4 Pro MAX
- GPU Acceleration: Metal (Apple Silicon native)
- Ollama Version: v0.14.3 (native installation)
Problem with Dockerized Ollama on Mac
Running Ollama inside Docker on macOS often results in:
- CPU-only inference (no Metal acceleration)
- Significantly slower performance (10-20x slower than native)
- Higher memory overhead (Docker VM layer)
2. Decision
Selected: Hybrid Local Architecture
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Host (macOS M4 Pro MAX) │
│ │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ Native Ollama │ │ Docker Engine │ │
│ │ (Port 11434) │ │ │ │
│ │ │ │ ┌───────────────────────┐ │ │
│ │ • Metal GPU ✅ │◄───┼──│ jorvis-local-webui │ │ │
│ │ • Full Performance │ │ │ (Open WebUI) │ │ │
│ │ │ │ │ │ │ │
│ │ Models: │ │ │ Connects via: │ │ │
│ │ • gemma3:12b │ │ │ host.docker.internal │ │ │
│ │ • (fallback GGUF) │ │ └───────────────────────┘ │ │
│ └─────────────────────┘ │ │ │
│ │ ┌───────────────────────┐ │ │
│ │ │ jorvis-api │ │ │
│ │ │ (NestJS Backend) │ │ │
│ │ └───────────────────────┘ │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ Fallback (if Ollama unreachable)
▼
┌─────────────────────┐
│ Gemini Cloud API │
│ (gemini-2.5-flash)│
└─────────────────────┘
Components
| Component | Location | Connection |
|---|---|---|
| Open WebUI | Docker (jorvis-local-webui) | http://host.docker.internal:11434 |
| Ollama | Host (native) | localhost:11434 |
| jorvis-api | Docker | Can use both Ollama and Gemini |
3. Configuration
3.1 Open WebUI Settings
# docker-compose.local.yml
services:
jorvis-local-webui:
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
- ENABLE_OLLAMA_API=true
3.2 Primary Model
# Run on HOST (not in Docker)
ollama pull gemma3:12b
Model Details:
- Name:
gemma3:12b - Size: ~8GB VRAM
- Performance: Excellent on M4 Pro MAX with Metal
- Use Case: General chat, code assistance, analytics queries
3.3 Fallback Models (GGUF)
Available local GGUF models for specialized tasks:
| Model | Size | Use Case |
|---|---|---|
qwen3-4b-text2sql-4bit.gguf | ~2.5GB | Text-to-SQL generation |
Qwen3-4B-Instruct-2507-Q5_K_M.gguf | ~3GB | General instruction following |
gemma-3-4b-it-null-space-abliterated.i1-Q5_K_M.gguf | ~3GB | Uncensored responses |
prem-1b-sql.Q6_K.gguf | ~1GB | Lightweight SQL generation |
Location: $PROJECT_ROOT/artifacts/ (project artifacts directory)
Loading GGUF in Ollama:
# Create Modelfile (replace $PROJECT_ROOT with actual path)
echo "FROM $PROJECT_ROOT/artifacts/qwen3-4b-text2sql-4bit.gguf" > Modelfile
ollama create qwen3-text2sql -f Modelfile
3.4 Cloud Fallback
If native Ollama is unreachable, fall back to Gemini Cloud:
// Connection check logic (pseudo-code)
async function getInferenceEndpoint(): Promise<string> {
try {
await fetch('http://localhost:11434/api/tags', { timeout: 2000 });
return 'http://localhost:11434'; // Use local Ollama
} catch {
return 'gemini'; // Fallback to Gemini Cloud
}
}
4. Implementation
4.1 Host Setup (One-time)
# 1. Verify Ollama is running
ollama --version # Should show v0.14.3+
# 2. Pull primary model
ollama pull gemma3:12b
# 3. Test inference
ollama run gemma3:12b "Hello, what model are you?"
# 4. Verify Metal acceleration
# Check Activity Monitor → GPU History during inference
4.2 Docker Compose Update
# deploy/docker-compose.local.yml
services:
jorvis-local-webui:
image: ghcr.io/open-webui/open-webui:main
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
- ENABLE_OLLAMA_API=true
- DEFAULT_MODELS=gemma3:12b
extra_hosts:
- "host.docker.internal:host-gateway"
4.3 Verification Commands
# Test Ollama from Docker container
docker exec jorvis-local-webui curl http://host.docker.internal:11434/api/tags
# Test model availability
curl http://localhost:11434/api/tags | jq '.models[].name'
# Test inference
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:12b",
"prompt": "What is 2+2?",
"stream": false
}'
5. Consequences
Positive
- Full Metal GPU acceleration — Native performance on Apple Silicon
- 10-20x faster inference compared to Dockerized Ollama
- Lower memory overhead — No Docker VM layer for inference
- Flexibility — Easy to swap models, add GGUF files
- Offline capability — Works without internet (after model download)
Negative
- Requires host Ollama installation — Additional setup step
- Port conflict potential — If host port 11434 is used by another service
- Manual model management — Models not managed by Docker Compose
Mitigations
- Document setup in
docs/operations/LOCAL_RUNTIME.md - Add health check for Ollama in startup scripts
- Implement automatic fallback to Gemini Cloud
6. Alternatives Considered
A. Dockerized Ollama
- Rejected: CPU-only on Mac, 10-20x slower
B. llama.cpp directly
- Rejected: More complex setup, Ollama provides better UX
C. Cloud-only (Gemini/OpenAI)
- Rejected: Violates local-first principle, requires internet
D. LM Studio
- Rejected: Less scriptable than Ollama, GUI-focused
7. References
- Ollama Documentation
- Open WebUI + Ollama Setup
- Apple Metal Performance Shaders
- ADR-0018: Voice Pipeline Local Architecture
docs/operations/LOCAL_RUNTIME.md
8. Approval
Status: Accepted — Approved by George (User) 2026-01-23
This ADR was created by Rovo (Architect) for Phase P local-first deployment.