Resilience Architecture

Version: 1.1 Status: Implemented (Core) Layer: Infrastructure Source: src/resilience/


1. Overview

Jorvis implements a robust resilience layer to handle the inherent instability of external LLM APIs (Gemini, OpenAI) and database connections. The system uses the Circuit Breaker, Retry, and Bulkhead patterns to prevent cascading failures and ensure graceful degradation.

2. Circuit Breaker Pattern

Service: CircuitBreakerService (src/resilience/circuit-breaker.service.ts)

Protects the system from repeatedly calling failing services.

2.1 States

  • CLOSED: Normal operation. Requests pass through.
  • OPEN: Failure threshold exceeded. All requests rejected immediately.
  • HALF_OPEN: Probe state. Limited requests allowed to test recovery.

2.2 Configuration (CircuitBreakerConfig)

These values are configurable via environment variables:

FeatureEnv VarDefaultDescription
Failure ThresholdJORVIS_CB_FAILURE_THRESHOLD5consecutive failures to open circuit
Success ThresholdJORVIS_CB_SUCCESS_THRESHOLD2consecutive successes to close circuit
Open DurationJORVIS_CB_OPEN_DURATION_MS30000ms to wait before half-open
Half-Open LimitJORVIS_CB_HALF_OPEN_MAX_CALLS3max probe requests

2.3 Usage Example

if (this.circuitBreaker.canCall('gemini-api')) {
  try {
    const result = await this.gemini.generate();
    this.circuitBreaker.recordSuccess('gemini-api');
    return result;
  } catch (error) {
    this.circuitBreaker.recordFailure('gemini-api');
    throw error;
  }
} else {
  throw new ServiceUnavailableException('Gemini API is unavailable (Circuit Open)');
}

3. Retry Strategy

Utility: backoff.util.ts

Implements Exponential Backoff with Jitter to safely retry transient failures.

3.1 Algorithm

delay=min(base×2attempt,max_delay)+jitterdelay = min(base \times 2^{attempt}, max\_delay) + jitter

  • Base: Initial delay (e.g., 1000ms)
  • Max Delay: Cap to prevent excessive waits (e.g., 10s)
  • Jitter: Random +/- 10% to prevent thundering herd

3.2 Scenarios

  1. Database Connection: Retries on startup (5 attempts).
  2. LLM Rate Limits (429): Retries with respect to Retry-After header.
  3. Network Timeouts: Retries for idempotent GET requests.

4. Graceful Degradation

Service: DegradationService (src/resilience/degradation.service.ts)

Ensures functionality (even if limited) when primary services fail.

Failure ModeFallback ActionExperience
Primary LLM (Gemini) DownSwitch to Secondary (OpenAI/Azure)Slower, but functional
RAG Vector Store DownSwitch to Keyword Search (Postgres)Lower relevance, but valid Data
SQL Generation FailsFallback to Pre-defined/Cached QueriesLimited scope
Voice TTS DownReturn Text-only responseNo audio

5. Rate Limiting

Service: RateLimiterService (src/resilience/rate-limiter.service.ts)

Protects internal resources and external APIs from abuse.

  • Scope: Per-Tenant or Per-User
  • Storage: Redis (distributed) or Memory (local)
  • Limits:
    • Free Tier: 50 requests/hour
    • Pro Tier: 1000 requests/hour

6. Implementation Status (v0.7.0)

ComponentStatusNotes
Circuit Breaker✅ ActiveProtecting Gemini/OpenAI calls
Exponential Backoff✅ ActiveDB startup, HTTP retries
Rate Limiting✅ ActiveBasic in-memory implementation
Fallback Chains⚠️ PartialLLM fallback works, RAG fallback planned

7. Future Roadmap (Phase R)

  • Redis-backed Rate Limiting: Move from memory to Redis for cluster support.
  • Bulkhead Pattern: Isolate execution pools for different tenants.
  • Adaptive Concurrency: Dynamically adjust concurrency based on latency.