ADR-0019: Local LLM Strategy (Hybrid Architecture)

Reviewed on 2026-03-06.

This accepted ADR remains useful as historical design lineage, but current local-model and fallback truth now lives in:

  • docs/configuration/MODEL_POLICY_CURRENT.md
  • docs/architecture/CURRENT_ARCHITECTURE.md
  • docs/FEATURE_MATRIX.md

Date: 2026-01-23 Status: Accepted Author: Rovo (Architect Agent) Context: Task-Voice completion, Phase P (Production Deployment - Local-First) Implementation Status Reviewed: Historical implementation note retained; verify current state via canonical model-policy docs.

1. Context

Jorvis aims to be a local-first analytics platform. For local LLM inference, we need a strategy that:

  • Maximizes performance on Apple Silicon (M4 Pro MAX)
  • Maintains compatibility with Docker-based Open WebUI
  • Provides fallback to cloud when local inference is unavailable

Hardware Context

  • Host Machine: Apple M4 Pro MAX
  • GPU Acceleration: Metal (Apple Silicon native)
  • Ollama Version: v0.14.3 (native installation)

Problem with Dockerized Ollama on Mac

Running Ollama inside Docker on macOS often results in:

  • CPU-only inference (no Metal acceleration)
  • Significantly slower performance (10-20x slower than native)
  • Higher memory overhead (Docker VM layer)

2. Decision

Selected: Hybrid Local Architecture

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     Host (macOS M4 Pro MAX)                 │
│                                                             │
│  ┌─────────────────────┐    ┌─────────────────────────────┐ │
│  │   Native Ollama     │    │       Docker Engine         │ │
│  │   (Port 11434)      │    │                             │ │
│  │                     │    │  ┌───────────────────────┐  │ │
│  │  • Metal GPU ✅     │◄───┼──│   jorvis-local-webui  │  │ │
│  │  • Full Performance │    │  │   (Open WebUI)        │  │ │
│  │                     │    │  │                       │  │ │
│  │  Models:            │    │  │  Connects via:        │  │ │
│  │  • gemma3:12b       │    │  │  host.docker.internal │  │ │
│  │  • (fallback GGUF)  │    │  └───────────────────────┘  │ │
│  └─────────────────────┘    │                             │ │
│                             │  ┌───────────────────────┐  │ │
│                             │  │   jorvis-api          │  │ │
│                             │  │   (NestJS Backend)    │  │ │
│                             │  └───────────────────────┘  │ │
│                             └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                              │ Fallback (if Ollama unreachable)
                    ┌─────────────────────┐
                    │   Gemini Cloud API  │
                    │   (gemini-2.5-flash)│
                    └─────────────────────┘

Components

ComponentLocationConnection
Open WebUIDocker (jorvis-local-webui)http://host.docker.internal:11434
OllamaHost (native)localhost:11434
jorvis-apiDockerCan use both Ollama and Gemini

3. Configuration

3.1 Open WebUI Settings

# docker-compose.local.yml
services:
  jorvis-local-webui:
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - ENABLE_OLLAMA_API=true

3.2 Primary Model

# Run on HOST (not in Docker)
ollama pull gemma3:12b

Model Details:

  • Name: gemma3:12b
  • Size: ~8GB VRAM
  • Performance: Excellent on M4 Pro MAX with Metal
  • Use Case: General chat, code assistance, analytics queries

3.3 Fallback Models (GGUF)

Available local GGUF models for specialized tasks:

ModelSizeUse Case
qwen3-4b-text2sql-4bit.gguf~2.5GBText-to-SQL generation
Qwen3-4B-Instruct-2507-Q5_K_M.gguf~3GBGeneral instruction following
gemma-3-4b-it-null-space-abliterated.i1-Q5_K_M.gguf~3GBUncensored responses
prem-1b-sql.Q6_K.gguf~1GBLightweight SQL generation

Location: $PROJECT_ROOT/artifacts/ (project artifacts directory)

Loading GGUF in Ollama:

# Create Modelfile (replace $PROJECT_ROOT with actual path)
echo "FROM $PROJECT_ROOT/artifacts/qwen3-4b-text2sql-4bit.gguf" > Modelfile
ollama create qwen3-text2sql -f Modelfile

3.4 Cloud Fallback

If native Ollama is unreachable, fall back to Gemini Cloud:

// Connection check logic (pseudo-code)
async function getInferenceEndpoint(): Promise<string> {
  try {
    await fetch('http://localhost:11434/api/tags', { timeout: 2000 });
    return 'http://localhost:11434';  // Use local Ollama
  } catch {
    return 'gemini';  // Fallback to Gemini Cloud
  }
}

4. Implementation

4.1 Host Setup (One-time)

# 1. Verify Ollama is running
ollama --version  # Should show v0.14.3+

# 2. Pull primary model
ollama pull gemma3:12b

# 3. Test inference
ollama run gemma3:12b "Hello, what model are you?"

# 4. Verify Metal acceleration
# Check Activity Monitor → GPU History during inference

4.2 Docker Compose Update

# deploy/docker-compose.local.yml
services:
  jorvis-local-webui:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - ENABLE_OLLAMA_API=true
      - DEFAULT_MODELS=gemma3:12b
    extra_hosts:
      - "host.docker.internal:host-gateway"

4.3 Verification Commands

# Test Ollama from Docker container
docker exec jorvis-local-webui curl http://host.docker.internal:11434/api/tags

# Test model availability
curl http://localhost:11434/api/tags | jq '.models[].name'

# Test inference
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:12b",
  "prompt": "What is 2+2?",
  "stream": false
}'

5. Consequences

Positive

  • Full Metal GPU acceleration — Native performance on Apple Silicon
  • 10-20x faster inference compared to Dockerized Ollama
  • Lower memory overhead — No Docker VM layer for inference
  • Flexibility — Easy to swap models, add GGUF files
  • Offline capability — Works without internet (after model download)

Negative

  • Requires host Ollama installation — Additional setup step
  • Port conflict potential — If host port 11434 is used by another service
  • Manual model management — Models not managed by Docker Compose

Mitigations

  • Document setup in docs/operations/LOCAL_RUNTIME.md
  • Add health check for Ollama in startup scripts
  • Implement automatic fallback to Gemini Cloud

6. Alternatives Considered

A. Dockerized Ollama

  • Rejected: CPU-only on Mac, 10-20x slower

B. llama.cpp directly

  • Rejected: More complex setup, Ollama provides better UX

C. Cloud-only (Gemini/OpenAI)

  • Rejected: Violates local-first principle, requires internet

D. LM Studio

  • Rejected: Less scriptable than Ollama, GUI-focused

7. References

8. Approval

Status: Accepted — Approved by George (User) 2026-01-23


This ADR was created by Rovo (Architect) for Phase P local-first deployment.