Schema Discovery Engine — Architecture

Version: v1.0.0 Date: 2026-01-22 Location: analytics-platform/src/knowledge/schema-discovery.service.ts

Important

This module contains security-sensitive logic including PII blocklists and SQL injection prevention. Review security considerations before modifying.

Overview

The Schema Discovery Engine crawls database schemas and indexes table metadata for vector search. It enables the RAG system to understand database structure without manual schema documentation.

graph LR
    A[PostgreSQL] --> B[SchemaDiscoveryService]
    B --> C[Table Introspection]
    B --> D[Column Analysis]
    B --> E[Value Probing]
    C --> F[SchemaContextPort]
    D --> F
    E --> F
    F --> G[Vector Store]

Core Capabilities

1. Schema Introspection

Queries information_schema to discover:

Tables: All user tables (excluding system schemas)
Columns: Column names, data types, ordinal positions
Metadata: Builds embeddings for semantic search

Excluded Schemas:

['pg_catalog', 'information_schema', 'pg_toast']

Excluded Tables:

['migrations', 'schema_embeddings']

2. Value Probing (Optional)

When enabled (JORVIS_VALUE_PROBING_ENABLED=true), the engine samples distinct values from categorical columns to enhance query understanding.

Eligibility Criteria (Whitelist):

Type	Eligible
`boolean`	✅ Always
`user-defined` (ENUM)	✅ Always
Columns ending with: `_status`, `_type`, `_code`, `_category`, `_kind`, `_state`, `_mode`, `_level`, `_priority`, `_flag`	✅
Columns starting with: `is_`, `has_`	✅

Blocklist (Never Probe):

['name', 'email', 'phone', 'address', 'ssn', 'password', 
 'token', 'secret', 'note', 'comment', 'description', 
 'text', 'message', 'content', 'body']

3. Security Sanitization

All probed values are sanitized before storage:

Check	Action
SQL Keywords (`SELECT`, `DROP`, etc.)	Reject
XSS Patterns (`<script>`, `javascript:`)	Reject
SQL Injection (`'--`, `/*`, etc.)	Reject
Special Characters	Strip (keep alphanumeric, spaces, basic punctuation)
Value Length	Truncate to 50 chars
Cardinality	Skip if > 50 distinct values

API Contract

Primary Method

async discoverAndIndex(): Promise<DiscoveryResult>

Output

interface DiscoveryResult {
  status: 'success' | 'error';
  indexedTables: number;
  durationMs: number;
  message?: string;
}

SchemaItem Structure

interface SchemaItem {
  tableName: string;        // "public.users"
  content: string;          // "Table: public.users\nColumns: id (integer), ..."
  metadata: {
    schema: string;
    columns: string[];
    columnTypes: { name: string; type: string }[];
    sampleValues?: Record<string, string[]>;  // If probing enabled
  };
}

Integration Points

System	Port	Purpose
`SchemaContextPort`	`index(items)`	Vector storage
`DataSource`	TypeORM	Database queries
`ConfigService`	N/A	Feature flags

Configuration

# Value probing (advanced)
JORVIS_VALUE_PROBING_ENABLED=false  # Default: disabled for security

# Probing limits (hardcoded, change in source)
# MAX_CARDINALITY=50
# MAX_VALUE_LENGTH=50
# PROBE_TIMEOUT_MS=2000

Usage Example

// Trigger discovery (typically on startup or schedule)
const result = await schemaDiscoveryService.discoverAndIndex();

console.log(`Indexed ${result.indexedTables} tables in ${result.durationMs}ms`);

Security Considerations

Caution

Value Probing Risk: Even with blocklists, probing can expose sensitive data if column naming conventions are not followed. Enable only in controlled environments.

Recommendations:

Keep probing disabled by default in production
Audit blocklist regularly for new PII patterns
Use TABLESAMPLE (automatic) to limit data exposure
Set statement timeout (2s default) to prevent long-running probes

Performance

Operation	Typical Duration
Schema introspection (100 tables)	~500ms
Value probing (100 tables)	~10-30s
Embedding generation	Depends on vector store