Voice Platform Architecture

Version: 1.1 Status: Implemented (Core) Component: Voice Gateway Source: src/voice/

1. Overview

The Voice Platform enables real-time, bidirectional voice interaction with Jorvis. It supports both WebSocket (Server-Side) and WebRTC (Client-Side) flows, utilizing the RealtimeGateway to manage audio streams and the AudioTranscoderService to handle format conversion.

2. Core Components

2.1 Realtime Gateway (`src/voice/realtime.gateway.ts`)

Protocol: WebSocket (/v1/realtime)
Responsibilities:
- Manages persistent connection.
- Handles audio_chunk events (binary).
- Emits transcript and audio_response events.

2.2 Gemini Live Service (`src/voice/gemini-live.service.ts`)

Role: Direct integration with Google Gemini Live API (Multimodal).
Flow:
- Streams user audio chunks directly to Gemini.
- Receives streaming text/audio response chunks.
- Low latency (<500ms).

2.3 Audio Transcoder (`src/voice/audio-transcoder.service.ts`)

Input: WebM / Ogg Opus (Browser Default).
Output: Linear16 PCM 24kHz (Required by Gemini).
Library: ffmpeg / prism-media.

2.4 Voice Intent Router (`src/voice/voice-intent-router.service.ts`)

Role: Classifies incoming transcripts to intelligently route requests between Conversational and OpenClaw execution paths.

3. Data Flow

Connection: Frontend connects to ws://api.jorvis.io/v1/realtime.
Streaming:
- Frontend records microphone → Sends Blob.
- Gateway → AudioTranscoder → GeminiLiveService.
Response:
- Gemini Stream → Gateway keys out audio events.
- Frontend plays audio buffer queue.

4. Protocols

4.1 Client Events (Inbound)

Event	Payload	Description
`start_session`	`{ config: VoiceConfig }`	Init session params
`audio_chunk`	`ArrayBuffer`	Raw audio data
`stop_session`	`{}`	End stream

4.2 Server Events (Outbound)

Event	Payload	Description
`transcript`	`{ text: string, type: 'user'\|'agent' }`	Real-time text
`audio_chunk`	`ArrayBuffer`	Response audio to play
`state`	`{ state: 'listening'\|'thinking'\|'speaking' }`	UI feedback state

5. Security

Auth: Standard Bearer Token via WebSocket Handshake query param ?token=....
Rate Limit: Enforced by connection duration (max 5 min/session).

6. Implementation Status (v0.7.0)

Feature	Status	Notes
WebSocket Gateway	✅ Active	Support for basic audio streaming
Gemini Live	✅ Active	Primary voice engine
Transcoding	✅ Active	Robust ffmpeg integration
VAD (Voice Detection)	⚠️ Partial	Relying on Gemini's internal VAD