Resilience Architecture
Resilience Architecture
Version: 1.1
Status: Implemented (Core)
Layer: Infrastructure
Source: src/resilience/
1. Overview
Jorvis implements a robust resilience layer to handle the inherent instability of external LLM APIs (Gemini, OpenAI) and database connections. The system uses the Circuit Breaker, Retry, and Bulkhead patterns to prevent cascading failures and ensure graceful degradation.
2. Circuit Breaker Pattern
Service: CircuitBreakerService (src/resilience/circuit-breaker.service.ts)
Protects the system from repeatedly calling failing services.
2.1 States
- CLOSED: Normal operation. Requests pass through.
- OPEN: Failure threshold exceeded. All requests rejected immediately.
- HALF_OPEN: Probe state. Limited requests allowed to test recovery.
2.2 Configuration (CircuitBreakerConfig)
These values are configurable via environment variables:
| Feature | Env Var | Default | Description |
|---|---|---|---|
| Failure Threshold | JORVIS_CB_FAILURE_THRESHOLD | 5 | consecutive failures to open circuit |
| Success Threshold | JORVIS_CB_SUCCESS_THRESHOLD | 2 | consecutive successes to close circuit |
| Open Duration | JORVIS_CB_OPEN_DURATION_MS | 30000 | ms to wait before half-open |
| Half-Open Limit | JORVIS_CB_HALF_OPEN_MAX_CALLS | 3 | max probe requests |
2.3 Usage Example
if (this.circuitBreaker.canCall('gemini-api')) {
try {
const result = await this.gemini.generate();
this.circuitBreaker.recordSuccess('gemini-api');
return result;
} catch (error) {
this.circuitBreaker.recordFailure('gemini-api');
throw error;
}
} else {
throw new ServiceUnavailableException('Gemini API is unavailable (Circuit Open)');
}
3. Retry Strategy
Utility: backoff.util.ts
Implements Exponential Backoff with Jitter to safely retry transient failures.
3.1 Algorithm
- Base: Initial delay (e.g., 1000ms)
- Max Delay: Cap to prevent excessive waits (e.g., 10s)
- Jitter: Random +/- 10% to prevent thundering herd
3.2 Scenarios
- Database Connection: Retries on startup (5 attempts).
- LLM Rate Limits (429): Retries with respect to
Retry-Afterheader. - Network Timeouts: Retries for idempotent GET requests.
4. Graceful Degradation
Service: DegradationService (src/resilience/degradation.service.ts)
Ensures functionality (even if limited) when primary services fail.
| Failure Mode | Fallback Action | Experience |
|---|---|---|
| Primary LLM (Gemini) Down | Switch to Secondary (OpenAI/Azure) | Slower, but functional |
| RAG Vector Store Down | Switch to Keyword Search (Postgres) | Lower relevance, but valid Data |
| SQL Generation Fails | Fallback to Pre-defined/Cached Queries | Limited scope |
| Voice TTS Down | Return Text-only response | No audio |
5. Rate Limiting
Service: RateLimiterService (src/resilience/rate-limiter.service.ts)
Protects internal resources and external APIs from abuse.
- Scope: Per-Tenant or Per-User
- Storage: Redis (distributed) or Memory (local)
- Limits:
- Free Tier: 50 requests/hour
- Pro Tier: 1000 requests/hour
6. Implementation Status (v0.7.0)
| Component | Status | Notes |
|---|---|---|
| Circuit Breaker | ✅ Active | Protecting Gemini/OpenAI calls |
| Exponential Backoff | ✅ Active | DB startup, HTTP retries |
| Rate Limiting | ✅ Active | Basic in-memory implementation |
| Fallback Chains | ⚠️ Partial | LLM fallback works, RAG fallback planned |
7. Future Roadmap (Phase R)
- Redis-backed Rate Limiting: Move from memory to Redis for cluster support.
- Bulkhead Pattern: Isolate execution pools for different tenants.
- Adaptive Concurrency: Dynamically adjust concurrency based on latency.