ADR-0018: Voice Pipeline Local Architecture

Reviewed on 2026-03-06.

This accepted ADR is retained for local-debug and voice-architecture lineage. Do not treat it as the sole source of current runtime truth. Current operational truth should be verified against:

docs/operations/DEPLOYMENT_RUNBOOK.md

docs/architecture/CURRENT_ARCHITECTURE.md

docs/FEATURE_MATRIX.md

Date: 2026-01-23 Status: Accepted Author: Rovo (Architect Agent) Context: Task-Voice debugging, Phase P (Production Deployment)

1. Context

The Voice Pipeline (STT/TTS) works correctly in Production (Cloud Run) but fails locally:

STT Issue: Returns 200 OK but empty transcription result
TTS Issue: Returns 500 Internal Server Error

Current Architecture Discrepancy

Component	Production	Local
Voice Backend	Dedicated `voice-gateway` container (Port 8787)	Embedded in `jorvis-api`
STT Model	`gemini-3-flash-preview`	`gemini-2.5-flash`
TTS Model	Gemini via gateway	`gemini-2.5-flash-preview-tts`
Protocol	WebSocket (bi-directional)	REST + WebSocket

Root Cause Hypothesis

Model Mismatch: Local uses older model versions that may have different API response formats
Missing Gateway: Production's voice-gateway handles audio format conversion and error recovery that jorvis-api doesn't implement
Transcoding Issues: Local ffmpeg transcoding may produce incompatible audio format for Gemini API

2. Decision

Selected Option: C — Hybrid Architecture

We will adopt a hybrid approach:

A. REST Endpoints (STT/TTS) — Fix in `jorvis-api`

For the OpenAI-compatible REST API (/v1/audio/transcriptions, /v1/audio/speech):

Action: Debug and fix existing Gemini adapters in analytics-platform/src/voice/adapters/
Rationale: These endpoints are simpler, easier to debug, and sufficient for Open WebUI integration
Models: Align with production: gemini-3-flash-preview (STT), gemini-2.5-flash-preview-tts (TTS)

B. WebSocket/Live Voice — Use `voice-gateway` (Optional)

For real-time bidirectional voice (Gemini Live):

Action: Add voice-gateway to docker-compose.local.yml as optional service
Rationale: Complex audio streaming benefits from dedicated microservice (per ADR-0017)
Activation: docker compose --profile voice up

C. Configuration Alignment

Update docker-compose.local.yml environment variables:

# Align with production models
- GEMINI_STT_MODEL=gemini-3-flash-preview
- GEMINI_TTS_MODEL=gemini-2.5-flash-preview-tts
- GEMINI_LIVE_MODEL=gemini-2.5-flash-native-audio-preview-12-2025

3. Implementation Plan

Phase 1: Debug REST Endpoints (Task-Voice)

Add detailed logging to GeminiSttAdapter and GeminiTtsAdapter
Verify API request/response format against Gemini documentation
Test with curl to isolate Open WebUI vs backend issues
Fix identified issues (likely mime_type or response parsing)

Phase 2: Configuration Alignment

Update docker-compose.local.yml with production model versions
Test STT/TTS with aligned configuration
Document working configuration in docs/operations/LOCAL_RUNTIME.md

Phase 3: Optional Gateway (Future)

Add voice-gateway service definition to docker-compose.local.yml
Use Docker Compose profiles for optional activation
Test WebSocket voice flow end-to-end

4. Consequences

Positive

Faster debugging cycle (REST is easier to test than WebSocket)
Reduced local resource usage (no extra container by default)
Parity with production configuration
Maintains ADR-0017 dual-mode strategy

Negative

Local won't have full voice-gateway features by default
Potential for configuration drift between local/prod

Mitigations

Document required environment variables clearly
Add health check endpoint for voice services
Consider future unification of voice backends

5. Technical Details

Files to Modify

analytics-platform/src/voice/adapters/gemini-stt.adapter.ts — Debug/fix transcription
analytics-platform/src/voice/adapters/gemini-tts.adapter.ts — Debug/fix synthesis
deploy/docker-compose.local.yml — Align model configuration
docs/operations/LOCAL_RUNTIME.md — Document voice setup

Debug Commands

# Test STT directly
curl -X POST http://localhost:3000/api/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "file=@test.wav" \
  -F "model=whisper-1"

# Test TTS directly
curl -X POST http://localhost:3000/api/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello world","voice":"nova"}' \
  --output test.mp3

6. Implementation Status

Phase	Description	Status	Completed By	Date
Phase 1	Debug REST Endpoints (STT/TTS)	✅ DONE	ANT (Executor)	2026-01-23
Phase 2	Configuration Alignment	✅ DONE	ANT (Executor)	2026-01-23
Phase 3	Optional Gateway Integration	✅ DONE	ANT (Executor)	2026-01-23

Summary

All phases of ADR-0018 have been successfully implemented. The Voice Pipeline now works correctly in the local OrbStack environment with the hybrid architecture approach.

7. References

ADR-0017: CheckEye Adoption (Voice Gateway dual-mode strategy)
docs/architecture/VOICE_PLATFORM.md — Voice architecture overview
docs/agent_ops/OUTBOX/task_voice_debug_evidence.md — Debug evidence
Production config: deploy/cloud-run-combined.yaml

8. Approval

Status: Accepted — Approved by George (User) 2026-01-23

This ADR was created by Rovo (Architect) as part of Task-Voice debugging.