ADR 0009: Schema Context RAG Strategy
ADR 0009: Schema Context RAG Strategy
Status
Proposed
Date
2026-01-02
Context
Jorvis must work with any database of any type, format, structure, and size. For large databases (100+ tables), passing the entire schema to the LLM context is:
- Expensive — 10,000+ tokens per query
- Noisy — irrelevant tables confuse the model
- Slow — increased latency
The current SchemaExplorerTool works for small schemas but doesn't scale.
Decision
Implement Schema Context RAG — a semantic retrieval system that finds only the relevant tables/columns for each user query.
Implementation defaults (v1)
- Embedding provider: OpenAI
text-embedding-3-smallviaOPENAI_API_KEY(can be made configurable later). - Schema annotations:
analytics-platform/config/schema-annotations.json, per-database map. - Trigger: Admin API
POST /internal/schema/sync(no auto-scan on boot).
Architecture
User Query: "Show top 10 customers by revenue"
│
▼
┌─────────────────────────────────────┐
│ Schema Context Retriever │
│ 1. Embed query │
│ 2. Search schema embeddings │
│ 3. Return top-k tables/columns │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Relevant Schema (5 tables, ~500 tokens) │
│ • customers (id, name, email) │
│ • orders (id, customer_id, amount) │
│ • ... │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ SQL Generator (focused) │
└─────────────────────────────────────┘
Components
| Component | Responsibility |
|---|---|
SchemaContextPort | Port interface (ADR-0008 compliant) |
SchemaEmbeddingsStore | Vector DB for schema metadata |
SchemaContextRetriever | Semantic search implementation |
SchemaDiscoveryService | Auto-scan database metadata |
Embedding Strategy
Each table/column is embedded with enriched context:
Table: customers
Description: Stores customer information
Columns:
- id (integer, PK): Unique customer identifier
- name (varchar): Customer full name
- email (varchar): Customer email address
- created_at (timestamp): Registration date
Relationships:
- orders.customer_id → customers.id
Sample values: [REDACTED]
Retrieval Flow
- Query embedding — embed user question
- Top-k retrieval — find 5-10 most relevant tables
- Relationship expansion — include FK-connected tables
- Context assembly — format for LLM prompt
- Caching — cache recent retrievals
Database Support
| Database | Priority | Status |
|---|---|---|
| PostgreSQL | P0 | First implementation |
| MySQL | P1 | After Postgres works |
| SQL Server | P2 | Enterprise use case |
| Oracle | P2 | Enterprise use case |
| ClickHouse | P3 | Analytics use case |
| Snowflake | P3 | Cloud DW use case |
| BigQuery | P3 | Cloud DW use case |
Alternatives Considered
1. GraphRAG (Neo4j)
Pros: Better relationship traversal, multi-hop queries Cons: Additional infrastructure, overkill for schema-only
Decision: Defer to Phase 2; vector RAG sufficient for v1
2. Full Schema in Context
Pros: Simple, no additional components Cons: Doesn't scale beyond ~30 tables
Decision: Rejected for universal DB support
3. LLM-based Schema Selection
Pros: More intelligent selection Cons: Additional LLM call, latency, cost
Decision: Consider as enhancement after v1
Safety Constraints
| Constraint | Implementation |
|---|---|
| No PII in embeddings | Redact sample values |
| No raw SQL in logs | Use parameterized logging |
| Feature flag | JORVIS_SCHEMA_RAG_ENABLED=false default |
| Access control | Inherit DB user permissions |
Success Metrics
| Metric | Target |
|---|---|
| Schema retrieval precision | >80% |
| Token reduction | 10x vs full schema |
| Latency overhead | <200ms |
| Databases supported | 3+ dialects |
Implementation Plan
| Phase | Task | Week |
|---|---|---|
| 1 | SchemaContextPort interface | 1 |
| 1 | SchemaEmbeddingsStore (pgvector) | 1 |
| 2 | SchemaContextRetriever | 2 |
| 2 | SchemaDiscoveryService | 2 |
| 3 | GraphOrchestrator integration | 2 |
| 4 | Multi-dialect adapters | 3 |
References
- ADR-0007: Audit Storage and Retention Policy
- ADR-0008: Minimal Ports and Adapters
- Research: docs/roadmap/RAG/*.txt (8 documents)