ADR 0009: Schema Context RAG Strategy

Status

Proposed

Date

2026-01-02

Context

Jorvis must work with any database of any type, format, structure, and size. For large databases (100+ tables), passing the entire schema to the LLM context is:

  • Expensive — 10,000+ tokens per query
  • Noisy — irrelevant tables confuse the model
  • Slow — increased latency

The current SchemaExplorerTool works for small schemas but doesn't scale.

Decision

Implement Schema Context RAG — a semantic retrieval system that finds only the relevant tables/columns for each user query.

Implementation defaults (v1)

  • Embedding provider: OpenAI text-embedding-3-small via OPENAI_API_KEY (can be made configurable later).
  • Schema annotations: analytics-platform/config/schema-annotations.json, per-database map.
  • Trigger: Admin API POST /internal/schema/sync (no auto-scan on boot).

Architecture

User Query: "Show top 10 customers by revenue"
┌─────────────────────────────────────┐
│       Schema Context Retriever      │
│  1. Embed query                     │
│  2. Search schema embeddings        │
│  3. Return top-k tables/columns     │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Relevant Schema (5 tables, ~500 tokens) │
│  • customers (id, name, email)      │
│  • orders (id, customer_id, amount) │
│  • ...                              │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│       SQL Generator (focused)       │
└─────────────────────────────────────┘

Components

ComponentResponsibility
SchemaContextPortPort interface (ADR-0008 compliant)
SchemaEmbeddingsStoreVector DB for schema metadata
SchemaContextRetrieverSemantic search implementation
SchemaDiscoveryServiceAuto-scan database metadata

Embedding Strategy

Each table/column is embedded with enriched context:

Table: customers
Description: Stores customer information
Columns:
  - id (integer, PK): Unique customer identifier
  - name (varchar): Customer full name
  - email (varchar): Customer email address
  - created_at (timestamp): Registration date
Relationships:
  - orders.customer_id → customers.id
Sample values: [REDACTED]

Retrieval Flow

  1. Query embedding — embed user question
  2. Top-k retrieval — find 5-10 most relevant tables
  3. Relationship expansion — include FK-connected tables
  4. Context assembly — format for LLM prompt
  5. Caching — cache recent retrievals

Database Support

DatabasePriorityStatus
PostgreSQLP0First implementation
MySQLP1After Postgres works
SQL ServerP2Enterprise use case
OracleP2Enterprise use case
ClickHouseP3Analytics use case
SnowflakeP3Cloud DW use case
BigQueryP3Cloud DW use case

Alternatives Considered

1. GraphRAG (Neo4j)

Pros: Better relationship traversal, multi-hop queries Cons: Additional infrastructure, overkill for schema-only

Decision: Defer to Phase 2; vector RAG sufficient for v1

2. Full Schema in Context

Pros: Simple, no additional components Cons: Doesn't scale beyond ~30 tables

Decision: Rejected for universal DB support

3. LLM-based Schema Selection

Pros: More intelligent selection Cons: Additional LLM call, latency, cost

Decision: Consider as enhancement after v1

Safety Constraints

ConstraintImplementation
No PII in embeddingsRedact sample values
No raw SQL in logsUse parameterized logging
Feature flagJORVIS_SCHEMA_RAG_ENABLED=false default
Access controlInherit DB user permissions

Success Metrics

MetricTarget
Schema retrieval precision>80%
Token reduction10x vs full schema
Latency overhead<200ms
Databases supported3+ dialects

Implementation Plan

PhaseTaskWeek
1SchemaContextPort interface1
1SchemaEmbeddingsStore (pgvector)1
2SchemaContextRetriever2
2SchemaDiscoveryService2
3GraphOrchestrator integration2
4Multi-dialect adapters3

References

  • ADR-0007: Audit Storage and Retention Policy
  • ADR-0008: Minimal Ports and Adapters
  • Research: docs/roadmap/RAG/*.txt (8 documents)