ADR-0018: Voice Pipeline Local Architecture

Reviewed on 2026-03-06.

This accepted ADR is retained for local-debug and voice-architecture lineage. Do not treat it as the sole source of current runtime truth. Current operational truth should be verified against:

  • docs/operations/DEPLOYMENT_RUNBOOK.md
  • docs/architecture/CURRENT_ARCHITECTURE.md
  • docs/FEATURE_MATRIX.md

Date: 2026-01-23 Status: Accepted Author: Rovo (Architect Agent) Context: Task-Voice debugging, Phase P (Production Deployment)

1. Context

The Voice Pipeline (STT/TTS) works correctly in Production (Cloud Run) but fails locally:

  • STT Issue: Returns 200 OK but empty transcription result
  • TTS Issue: Returns 500 Internal Server Error

Current Architecture Discrepancy

ComponentProductionLocal
Voice BackendDedicated voice-gateway container (Port 8787)Embedded in jorvis-api
STT Modelgemini-3-flash-previewgemini-2.5-flash
TTS ModelGemini via gatewaygemini-2.5-flash-preview-tts
ProtocolWebSocket (bi-directional)REST + WebSocket

Root Cause Hypothesis

  1. Model Mismatch: Local uses older model versions that may have different API response formats
  2. Missing Gateway: Production's voice-gateway handles audio format conversion and error recovery that jorvis-api doesn't implement
  3. Transcoding Issues: Local ffmpeg transcoding may produce incompatible audio format for Gemini API

2. Decision

Selected Option: C — Hybrid Architecture

We will adopt a hybrid approach:

A. REST Endpoints (STT/TTS) — Fix in jorvis-api

For the OpenAI-compatible REST API (/v1/audio/transcriptions, /v1/audio/speech):

  • Action: Debug and fix existing Gemini adapters in analytics-platform/src/voice/adapters/
  • Rationale: These endpoints are simpler, easier to debug, and sufficient for Open WebUI integration
  • Models: Align with production: gemini-3-flash-preview (STT), gemini-2.5-flash-preview-tts (TTS)

B. WebSocket/Live Voice — Use voice-gateway (Optional)

For real-time bidirectional voice (Gemini Live):

  • Action: Add voice-gateway to docker-compose.local.yml as optional service
  • Rationale: Complex audio streaming benefits from dedicated microservice (per ADR-0017)
  • Activation: docker compose --profile voice up

C. Configuration Alignment

Update docker-compose.local.yml environment variables:

# Align with production models
- GEMINI_STT_MODEL=gemini-3-flash-preview
- GEMINI_TTS_MODEL=gemini-2.5-flash-preview-tts
- GEMINI_LIVE_MODEL=gemini-2.5-flash-native-audio-preview-12-2025

3. Implementation Plan

Phase 1: Debug REST Endpoints (Task-Voice)

  1. Add detailed logging to GeminiSttAdapter and GeminiTtsAdapter
  2. Verify API request/response format against Gemini documentation
  3. Test with curl to isolate Open WebUI vs backend issues
  4. Fix identified issues (likely mime_type or response parsing)

Phase 2: Configuration Alignment

  1. Update docker-compose.local.yml with production model versions
  2. Test STT/TTS with aligned configuration
  3. Document working configuration in docs/operations/LOCAL_RUNTIME.md

Phase 3: Optional Gateway (Future)

  1. Add voice-gateway service definition to docker-compose.local.yml
  2. Use Docker Compose profiles for optional activation
  3. Test WebSocket voice flow end-to-end

4. Consequences

Positive

  • Faster debugging cycle (REST is easier to test than WebSocket)
  • Reduced local resource usage (no extra container by default)
  • Parity with production configuration
  • Maintains ADR-0017 dual-mode strategy

Negative

  • Local won't have full voice-gateway features by default
  • Potential for configuration drift between local/prod

Mitigations

  • Document required environment variables clearly
  • Add health check endpoint for voice services
  • Consider future unification of voice backends

5. Technical Details

Files to Modify

  • analytics-platform/src/voice/adapters/gemini-stt.adapter.ts — Debug/fix transcription
  • analytics-platform/src/voice/adapters/gemini-tts.adapter.ts — Debug/fix synthesis
  • deploy/docker-compose.local.yml — Align model configuration
  • docs/operations/LOCAL_RUNTIME.md — Document voice setup

Debug Commands

# Test STT directly
curl -X POST http://localhost:3000/api/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "file=@test.wav" \
  -F "model=whisper-1"

# Test TTS directly
curl -X POST http://localhost:3000/api/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello world","voice":"nova"}' \
  --output test.mp3

6. Implementation Status

PhaseDescriptionStatusCompleted ByDate
Phase 1Debug REST Endpoints (STT/TTS)✅ DONEANT (Executor)2026-01-23
Phase 2Configuration Alignment✅ DONEANT (Executor)2026-01-23
Phase 3Optional Gateway Integration✅ DONEANT (Executor)2026-01-23

Summary

All phases of ADR-0018 have been successfully implemented. The Voice Pipeline now works correctly in the local OrbStack environment with the hybrid architecture approach.

7. References

  • ADR-0017: CheckEye Adoption (Voice Gateway dual-mode strategy)
  • docs/architecture/VOICE_PLATFORM.md — Voice architecture overview
  • docs/agent_ops/OUTBOX/task_voice_debug_evidence.md — Debug evidence
  • Production config: deploy/cloud-run-combined.yaml

8. Approval

Status: Accepted — Approved by George (User) 2026-01-23


This ADR was created by Rovo (Architect) as part of Task-Voice debugging.