ADR-0019: Local LLM Strategy (Hybrid Architecture)

Reviewed on 2026-03-06.

This accepted ADR remains useful as historical design lineage, but current local-model and fallback truth now lives in:

docs/configuration/MODEL_POLICY_CURRENT.md

docs/architecture/CURRENT_ARCHITECTURE.md

docs/FEATURE_MATRIX.md

Date: 2026-01-23 Status: Accepted Author: Rovo (Architect Agent) Context: Task-Voice completion, Phase P (Production Deployment - Local-First) Implementation Status Reviewed: Historical implementation note retained; verify current state via canonical model-policy docs.

1. Context

Jorvis aims to be a local-first analytics platform. For local LLM inference, we need a strategy that:

Maximizes performance on Apple Silicon (M4 Pro MAX)
Maintains compatibility with Docker-based Open WebUI
Provides fallback to cloud when local inference is unavailable

Hardware Context

Host Machine: Apple M4 Pro MAX
GPU Acceleration: Metal (Apple Silicon native)
Ollama Version: v0.14.3 (native installation)

Problem with Dockerized Ollama on Mac

Running Ollama inside Docker on macOS often results in:

CPU-only inference (no Metal acceleration)
Significantly slower performance (10-20x slower than native)
Higher memory overhead (Docker VM layer)

2. Decision

Selected: Hybrid Local Architecture

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     Host (macOS M4 Pro MAX)                 │
│                                                             │
│  ┌─────────────────────┐    ┌─────────────────────────────┐ │
│  │   Native Ollama     │    │       Docker Engine         │ │
│  │   (Port 11434)      │    │                             │ │
│  │                     │    │  ┌───────────────────────┐  │ │
│  │  • Metal GPU ✅     │◄───┼──│   jorvis-local-webui  │  │ │
│  │  • Full Performance │    │  │   (Open WebUI)        │  │ │
│  │                     │    │  │                       │  │ │
│  │  Models:            │    │  │  Connects via:        │  │ │
│  │  • gemma3:12b       │    │  │  host.docker.internal │  │ │
│  │  • (fallback GGUF)  │    │  └───────────────────────┘  │ │
│  └─────────────────────┘    │                             │ │
│                             │  ┌───────────────────────┐  │ │
│                             │  │   jorvis-api          │  │ │
│                             │  │   (NestJS Backend)    │  │ │
│                             │  └───────────────────────┘  │ │
│                             └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                              │
                              │ Fallback (if Ollama unreachable)
                              ▼
                    ┌─────────────────────┐
                    │   Gemini Cloud API  │
                    │   (gemini-2.5-flash)│
                    └─────────────────────┘

Components

Component	Location	Connection
Open WebUI	Docker (`jorvis-local-webui`)	`http://host.docker.internal:11434`
Ollama	Host (native)	`localhost:11434`
jorvis-api	Docker	Can use both Ollama and Gemini

3. Configuration

3.1 Open WebUI Settings

# docker-compose.local.yml
services:
  jorvis-local-webui:
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - ENABLE_OLLAMA_API=true

3.2 Primary Model

# Run on HOST (not in Docker)
ollama pull gemma3:12b

Model Details:

Name: gemma3:12b
Size: ~8GB VRAM
Performance: Excellent on M4 Pro MAX with Metal
Use Case: General chat, code assistance, analytics queries

3.3 Fallback Models (GGUF)

Available local GGUF models for specialized tasks:

Model	Size	Use Case
`qwen3-4b-text2sql-4bit.gguf`	~2.5GB	Text-to-SQL generation
`Qwen3-4B-Instruct-2507-Q5_K_M.gguf`	~3GB	General instruction following
`gemma-3-4b-it-null-space-abliterated.i1-Q5_K_M.gguf`	~3GB	Uncensored responses
`prem-1b-sql.Q6_K.gguf`	~1GB	Lightweight SQL generation

Location: $PROJECT_ROOT/artifacts/ (project artifacts directory)

Loading GGUF in Ollama:

# Create Modelfile (replace $PROJECT_ROOT with actual path)
echo "FROM $PROJECT_ROOT/artifacts/qwen3-4b-text2sql-4bit.gguf" > Modelfile
ollama create qwen3-text2sql -f Modelfile

3.4 Cloud Fallback

If native Ollama is unreachable, fall back to Gemini Cloud:

// Connection check logic (pseudo-code)
async function getInferenceEndpoint(): Promise<string> {
  try {
    await fetch('http://localhost:11434/api/tags', { timeout: 2000 });
    return 'http://localhost:11434';  // Use local Ollama
  } catch {
    return 'gemini';  // Fallback to Gemini Cloud
  }
}

4. Implementation

4.1 Host Setup (One-time)

# 1. Verify Ollama is running
ollama --version  # Should show v0.14.3+

# 2. Pull primary model
ollama pull gemma3:12b

# 3. Test inference
ollama run gemma3:12b "Hello, what model are you?"

# 4. Verify Metal acceleration
# Check Activity Monitor → GPU History during inference

4.2 Docker Compose Update

# deploy/docker-compose.local.yml
services:
  jorvis-local-webui:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - ENABLE_OLLAMA_API=true
      - DEFAULT_MODELS=gemma3:12b
    extra_hosts:
      - "host.docker.internal:host-gateway"

4.3 Verification Commands

# Test Ollama from Docker container
docker exec jorvis-local-webui curl http://host.docker.internal:11434/api/tags

# Test model availability
curl http://localhost:11434/api/tags | jq '.models[].name'

# Test inference
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:12b",
  "prompt": "What is 2+2?",
  "stream": false
}'

5. Consequences

Positive

Full Metal GPU acceleration — Native performance on Apple Silicon
10-20x faster inference compared to Dockerized Ollama
Lower memory overhead — No Docker VM layer for inference
Flexibility — Easy to swap models, add GGUF files
Offline capability — Works without internet (after model download)

Negative

Requires host Ollama installation — Additional setup step
Port conflict potential — If host port 11434 is used by another service
Manual model management — Models not managed by Docker Compose

Mitigations

Document setup in docs/operations/LOCAL_RUNTIME.md
Add health check for Ollama in startup scripts
Implement automatic fallback to Gemini Cloud

6. Alternatives Considered

A. Dockerized Ollama

Rejected: CPU-only on Mac, 10-20x slower

B. llama.cpp directly

Rejected: More complex setup, Ollama provides better UX

C. Cloud-only (Gemini/OpenAI)

Rejected: Violates local-first principle, requires internet

D. LM Studio

Rejected: Less scriptable than Ollama, GUI-focused

7. References

Ollama Documentation
Open WebUI + Ollama Setup
Apple Metal Performance Shaders
ADR-0018: Voice Pipeline Local Architecture
docs/operations/LOCAL_RUNTIME.md

8. Approval

Status: Accepted — Approved by George (User) 2026-01-23

This ADR was created by Rovo (Architect) for Phase P local-first deployment.