Resilience Architecture

Version: 1.1 Status: Implemented (Core) Layer: Infrastructure Source: src/resilience/

1. Overview

Jorvis implements a robust resilience layer to handle the inherent instability of external LLM APIs (Gemini, OpenAI) and database connections. The system uses the Circuit Breaker, Retry, and Bulkhead patterns to prevent cascading failures and ensure graceful degradation.

2. Circuit Breaker Pattern

Service: CircuitBreakerService (src/resilience/circuit-breaker.service.ts)

Protects the system from repeatedly calling failing services.

2.1 States

CLOSED: Normal operation. Requests pass through.
OPEN: Failure threshold exceeded. All requests rejected immediately.
HALF_OPEN: Probe state. Limited requests allowed to test recovery.

2.2 Configuration (`CircuitBreakerConfig`)

These values are configurable via environment variables:

Feature	Env Var	Default	Description
Failure Threshold	`JORVIS_CB_FAILURE_THRESHOLD`	5	consecutive failures to open circuit
Success Threshold	`JORVIS_CB_SUCCESS_THRESHOLD`	2	consecutive successes to close circuit
Open Duration	`JORVIS_CB_OPEN_DURATION_MS`	30000	ms to wait before half-open
Half-Open Limit	`JORVIS_CB_HALF_OPEN_MAX_CALLS`	3	max probe requests

2.3 Usage Example

if (this.circuitBreaker.canCall('gemini-api')) {
  try {
    const result = await this.gemini.generate();
    this.circuitBreaker.recordSuccess('gemini-api');
    return result;
  } catch (error) {
    this.circuitBreaker.recordFailure('gemini-api');
    throw error;
  }
} else {
  throw new ServiceUnavailableException('Gemini API is unavailable (Circuit Open)');
}

3. Retry Strategy

Utility: backoff.util.ts

Implements Exponential Backoff with Jitter to safely retry transient failures.

3.1 Algorithm

$delay = min(base \times 2^{attempt}, max\_delay) + jitter$

Base: Initial delay (e.g., 1000ms)
Max Delay: Cap to prevent excessive waits (e.g., 10s)
Jitter: Random +/- 10% to prevent thundering herd

3.2 Scenarios

Database Connection: Retries on startup (5 attempts).
LLM Rate Limits (429): Retries with respect to Retry-After header.
Network Timeouts: Retries for idempotent GET requests.

4. Graceful Degradation

Service: DegradationService (src/resilience/degradation.service.ts)

Ensures functionality (even if limited) when primary services fail.

Failure Mode	Fallback Action	Experience
Primary LLM (Gemini) Down	Switch to Secondary (OpenAI/Azure)	Slower, but functional
RAG Vector Store Down	Switch to Keyword Search (Postgres)	Lower relevance, but valid Data
SQL Generation Fails	Fallback to Pre-defined/Cached Queries	Limited scope
Voice TTS Down	Return Text-only response	No audio

5. Rate Limiting

Service: RateLimiterService (src/resilience/rate-limiter.service.ts)

Protects internal resources and external APIs from abuse.

Scope: Per-Tenant or Per-User
Storage: Redis (distributed) or Memory (local)
Limits:
- Free Tier: 50 requests/hour
- Pro Tier: 1000 requests/hour

6. Implementation Status (v0.7.0)

Component	Status	Notes
Circuit Breaker	✅ Active	Protecting Gemini/OpenAI calls
Exponential Backoff	✅ Active	DB startup, HTTP retries
Rate Limiting	✅ Active	Basic in-memory implementation
Fallback Chains	⚠️ Partial	LLM fallback works, RAG fallback planned

7. Future Roadmap (Phase R)

Redis-backed Rate Limiting: Move from memory to Redis for cluster support.
Bulkhead Pattern: Isolate execution pools for different tenants.
Adaptive Concurrency: Dynamically adjust concurrency based on latency.