Voice Gateway Architecture¶
At Jorvis, we believe that natural, conversational interaction is the future of enterprise data analysis. To achieve this, we developed a proprietary Voice Gateway built on WebRTC and WebSocket technologies, designed specifically to deliver ultra-low latency (<50ms) voice experiences.
Core Components¶
- WebRTC Ingestion: Audio is streamed directly from the client's browser to our edge nodes using WebRTC. This avoids the overhead of traditional HTTP request-response cycles for audio chunks, ensuring immediate audio availability at the server.
- WebSocket Signaling: Control messages, intent routing signals, and system state are managed via a persistent WebSocket connection, allowing for full-duplex communication without latency penalties.
- Voice Intent Routing: Before passing audio to heavy Large Language Models, a lightweight routing layer analyzes the acoustic features and initial transcription snippets to determine the user's intent. If the query is a simple navigation command or a known cached query, it bypasses the LLM entirely, resulting in near-instantaneous responses.
Achieving <50ms Latency¶
Traditional voice assistants suffer from latency caused by sequential processing: Speech-to-Text (STT) -> Natural Language Understanding (NLU) -> Text-to-Speech (TTS).
Jorvis achieves <50ms latency by: - Streaming pipelines: Processing begins as soon as the first few milliseconds of audio are received. - Edge computing: Terminating WebRTC connections as close to the user as possible. - Optimized backends: The core routing logic is highly optimized to minimize execution time. - Predictive pre-fetching: If the Voice Intent Router predicts a data fetch will be required, it initiates the database query before the user has even finished speaking.
Security and Privacy¶
All WebRTC streams are end-to-end encrypted. We employ strict ephemeral secret rotation for WebRTC token generation, ensuring that no unauthorized client can establish a voice session.