Schema Discovery Engine — Architecture

Version: v1.0.0 Date: 2026-01-22 Location: analytics-platform/src/knowledge/schema-discovery.service.ts

Important

This module contains security-sensitive logic including PII blocklists and SQL injection prevention. Review security considerations before modifying.


Overview

The Schema Discovery Engine crawls database schemas and indexes table metadata for vector search. It enables the RAG system to understand database structure without manual schema documentation.

graph LR
    A[PostgreSQL] --> B[SchemaDiscoveryService]
    B --> C[Table Introspection]
    B --> D[Column Analysis]
    B --> E[Value Probing]
    C --> F[SchemaContextPort]
    D --> F
    E --> F
    F --> G[Vector Store]

Core Capabilities

1. Schema Introspection

Queries information_schema to discover:

  • Tables: All user tables (excluding system schemas)
  • Columns: Column names, data types, ordinal positions
  • Metadata: Builds embeddings for semantic search

Excluded Schemas:

['pg_catalog', 'information_schema', 'pg_toast']

Excluded Tables:

['migrations', 'schema_embeddings']

2. Value Probing (Optional)

When enabled (JORVIS_VALUE_PROBING_ENABLED=true), the engine samples distinct values from categorical columns to enhance query understanding.

Eligibility Criteria (Whitelist):

TypeEligible
boolean✅ Always
user-defined (ENUM)✅ Always
Columns ending with: _status, _type, _code, _category, _kind, _state, _mode, _level, _priority, _flag
Columns starting with: is_, has_

Blocklist (Never Probe):

['name', 'email', 'phone', 'address', 'ssn', 'password', 
 'token', 'secret', 'note', 'comment', 'description', 
 'text', 'message', 'content', 'body']

3. Security Sanitization

All probed values are sanitized before storage:

CheckAction
SQL Keywords (SELECT, DROP, etc.)Reject
XSS Patterns (<script>, javascript:)Reject
SQL Injection ('--, /*, etc.)Reject
Special CharactersStrip (keep alphanumeric, spaces, basic punctuation)
Value LengthTruncate to 50 chars
CardinalitySkip if > 50 distinct values

API Contract

Primary Method

async discoverAndIndex(): Promise<DiscoveryResult>

Output

interface DiscoveryResult {
  status: 'success' | 'error';
  indexedTables: number;
  durationMs: number;
  message?: string;
}

SchemaItem Structure

interface SchemaItem {
  tableName: string;        // "public.users"
  content: string;          // "Table: public.users\nColumns: id (integer), ..."
  metadata: {
    schema: string;
    columns: string[];
    columnTypes: { name: string; type: string }[];
    sampleValues?: Record<string, string[]>;  // If probing enabled
  };
}

Integration Points

SystemPortPurpose
SchemaContextPortindex(items)Vector storage
DataSourceTypeORMDatabase queries
ConfigServiceN/AFeature flags

Configuration

# Value probing (advanced)
JORVIS_VALUE_PROBING_ENABLED=false  # Default: disabled for security

# Probing limits (hardcoded, change in source)
# MAX_CARDINALITY=50
# MAX_VALUE_LENGTH=50
# PROBE_TIMEOUT_MS=2000

Usage Example

// Trigger discovery (typically on startup or schedule)
const result = await schemaDiscoveryService.discoverAndIndex();

console.log(`Indexed ${result.indexedTables} tables in ${result.durationMs}ms`);

Security Considerations

Caution

Value Probing Risk: Even with blocklists, probing can expose sensitive data if column naming conventions are not followed. Enable only in controlled environments.

Recommendations:

  1. Keep probing disabled by default in production
  2. Audit blocklist regularly for new PII patterns
  3. Use TABLESAMPLE (automatic) to limit data exposure
  4. Set statement timeout (2s default) to prevent long-running probes

Performance

OperationTypical Duration
Schema introspection (100 tables)~500ms
Value probing (100 tables)~10-30s
Embedding generationDepends on vector store