ADR-0018: Voice Pipeline Local Architecture
ADR-0018: Voice Pipeline Local Architecture
Reviewed on 2026-03-06.
This accepted ADR is retained for local-debug and voice-architecture lineage. Do not treat it as the sole source of current runtime truth. Current operational truth should be verified against:
docs/operations/DEPLOYMENT_RUNBOOK.mddocs/architecture/CURRENT_ARCHITECTURE.mddocs/FEATURE_MATRIX.md
Date: 2026-01-23 Status: Accepted Author: Rovo (Architect Agent) Context: Task-Voice debugging, Phase P (Production Deployment)
1. Context
The Voice Pipeline (STT/TTS) works correctly in Production (Cloud Run) but fails locally:
- STT Issue: Returns 200 OK but empty transcription result
- TTS Issue: Returns 500 Internal Server Error
Current Architecture Discrepancy
| Component | Production | Local |
|---|---|---|
| Voice Backend | Dedicated voice-gateway container (Port 8787) | Embedded in jorvis-api |
| STT Model | gemini-3-flash-preview | gemini-2.5-flash |
| TTS Model | Gemini via gateway | gemini-2.5-flash-preview-tts |
| Protocol | WebSocket (bi-directional) | REST + WebSocket |
Root Cause Hypothesis
- Model Mismatch: Local uses older model versions that may have different API response formats
- Missing Gateway: Production's
voice-gatewayhandles audio format conversion and error recovery thatjorvis-apidoesn't implement - Transcoding Issues: Local ffmpeg transcoding may produce incompatible audio format for Gemini API
2. Decision
Selected Option: C — Hybrid Architecture
We will adopt a hybrid approach:
A. REST Endpoints (STT/TTS) — Fix in jorvis-api
For the OpenAI-compatible REST API (/v1/audio/transcriptions, /v1/audio/speech):
- Action: Debug and fix existing Gemini adapters in
analytics-platform/src/voice/adapters/ - Rationale: These endpoints are simpler, easier to debug, and sufficient for Open WebUI integration
- Models: Align with production:
gemini-3-flash-preview(STT),gemini-2.5-flash-preview-tts(TTS)
B. WebSocket/Live Voice — Use voice-gateway (Optional)
For real-time bidirectional voice (Gemini Live):
- Action: Add
voice-gatewaytodocker-compose.local.ymlas optional service - Rationale: Complex audio streaming benefits from dedicated microservice (per ADR-0017)
- Activation:
docker compose --profile voice up
C. Configuration Alignment
Update docker-compose.local.yml environment variables:
# Align with production models
- GEMINI_STT_MODEL=gemini-3-flash-preview
- GEMINI_TTS_MODEL=gemini-2.5-flash-preview-tts
- GEMINI_LIVE_MODEL=gemini-2.5-flash-native-audio-preview-12-2025
3. Implementation Plan
Phase 1: Debug REST Endpoints (Task-Voice)
- Add detailed logging to
GeminiSttAdapterandGeminiTtsAdapter - Verify API request/response format against Gemini documentation
- Test with curl to isolate Open WebUI vs backend issues
- Fix identified issues (likely mime_type or response parsing)
Phase 2: Configuration Alignment
- Update
docker-compose.local.ymlwith production model versions - Test STT/TTS with aligned configuration
- Document working configuration in
docs/operations/LOCAL_RUNTIME.md
Phase 3: Optional Gateway (Future)
- Add
voice-gatewayservice definition todocker-compose.local.yml - Use Docker Compose profiles for optional activation
- Test WebSocket voice flow end-to-end
4. Consequences
Positive
- Faster debugging cycle (REST is easier to test than WebSocket)
- Reduced local resource usage (no extra container by default)
- Parity with production configuration
- Maintains ADR-0017 dual-mode strategy
Negative
- Local won't have full voice-gateway features by default
- Potential for configuration drift between local/prod
Mitigations
- Document required environment variables clearly
- Add health check endpoint for voice services
- Consider future unification of voice backends
5. Technical Details
Files to Modify
analytics-platform/src/voice/adapters/gemini-stt.adapter.ts— Debug/fix transcriptionanalytics-platform/src/voice/adapters/gemini-tts.adapter.ts— Debug/fix synthesisdeploy/docker-compose.local.yml— Align model configurationdocs/operations/LOCAL_RUNTIME.md— Document voice setup
Debug Commands
# Test STT directly
curl -X POST http://localhost:3000/api/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F "file=@test.wav" \
-F "model=whisper-1"
# Test TTS directly
curl -X POST http://localhost:3000/api/v1/audio/speech \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","input":"Hello world","voice":"nova"}' \
--output test.mp3
6. Implementation Status
| Phase | Description | Status | Completed By | Date |
|---|---|---|---|---|
| Phase 1 | Debug REST Endpoints (STT/TTS) | ✅ DONE | ANT (Executor) | 2026-01-23 |
| Phase 2 | Configuration Alignment | ✅ DONE | ANT (Executor) | 2026-01-23 |
| Phase 3 | Optional Gateway Integration | ✅ DONE | ANT (Executor) | 2026-01-23 |
Summary
All phases of ADR-0018 have been successfully implemented. The Voice Pipeline now works correctly in the local OrbStack environment with the hybrid architecture approach.
7. References
- ADR-0017: CheckEye Adoption (Voice Gateway dual-mode strategy)
docs/architecture/VOICE_PLATFORM.md— Voice architecture overviewdocs/agent_ops/OUTBOX/task_voice_debug_evidence.md— Debug evidence- Production config:
deploy/cloud-run-combined.yaml
8. Approval
Status: Accepted — Approved by George (User) 2026-01-23
This ADR was created by Rovo (Architect) as part of Task-Voice debugging.