Voice Platform Architecture

Version: 1.1 Status: Implemented (Core) Component: Voice Gateway Source: src/voice/


1. Overview

The Voice Platform enables real-time, bidirectional voice interaction with Jorvis. It supports both WebSocket (Server-Side) and WebRTC (Client-Side) flows, utilizing the RealtimeGateway to manage audio streams and the AudioTranscoderService to handle format conversion.

2. Core Components

2.1 Realtime Gateway (src/voice/realtime.gateway.ts)

  • Protocol: WebSocket (/v1/realtime)
  • Responsibilities:
    • Manages persistent connection.
    • Handles audio_chunk events (binary).
    • Emits transcript and audio_response events.

2.2 Gemini Live Service (src/voice/gemini-live.service.ts)

  • Role: Direct integration with Google Gemini Live API (Multimodal).
  • Flow:
    • Streams user audio chunks directly to Gemini.
    • Receives streaming text/audio response chunks.
    • Low latency (<500ms).

2.3 Audio Transcoder (src/voice/audio-transcoder.service.ts)

  • Input: WebM / Ogg Opus (Browser Default).
  • Output: Linear16 PCM 24kHz (Required by Gemini).
  • Library: ffmpeg / prism-media.

2.4 Voice Intent Router (src/voice/voice-intent-router.service.ts)

  • Role: Classifies incoming transcripts to intelligently route requests between Conversational and OpenClaw execution paths.

3. Data Flow

  1. Connection: Frontend connects to ws://api.jorvis.io/v1/realtime.
  2. Streaming:
    • Frontend records microphone → Sends Blob.
    • Gateway → AudioTranscoderGeminiLiveService.
  3. Response:
    • Gemini Stream → Gateway keys out audio events.
    • Frontend plays audio buffer queue.

4. Protocols

4.1 Client Events (Inbound)

EventPayloadDescription
start_session{ config: VoiceConfig }Init session params
audio_chunkArrayBufferRaw audio data
stop_session{}End stream

4.2 Server Events (Outbound)

EventPayloadDescription
transcript{ text: string, type: 'user'|'agent' }Real-time text
audio_chunkArrayBufferResponse audio to play
state{ state: 'listening'|'thinking'|'speaking' }UI feedback state

5. Security

  • Auth: Standard Bearer Token via WebSocket Handshake query param ?token=....
  • Rate Limit: Enforced by connection duration (max 5 min/session).

6. Implementation Status (v0.7.0)

FeatureStatusNotes
WebSocket Gateway✅ ActiveSupport for basic audio streaming
Gemini Live✅ ActivePrimary voice engine
Transcoding✅ ActiveRobust ffmpeg integration
VAD (Voice Detection)⚠️ PartialRelying on Gemini's internal VAD