Schema Discovery Engine — Architecture
Schema Discovery Engine — Architecture
Version: v1.0.0
Date: 2026-01-22
Location: analytics-platform/src/knowledge/schema-discovery.service.ts
This module contains security-sensitive logic including PII blocklists and SQL injection prevention. Review security considerations before modifying.
Overview
The Schema Discovery Engine crawls database schemas and indexes table metadata for vector search. It enables the RAG system to understand database structure without manual schema documentation.
graph LR
A[PostgreSQL] --> B[SchemaDiscoveryService]
B --> C[Table Introspection]
B --> D[Column Analysis]
B --> E[Value Probing]
C --> F[SchemaContextPort]
D --> F
E --> F
F --> G[Vector Store]
Core Capabilities
1. Schema Introspection
Queries information_schema to discover:
- Tables: All user tables (excluding system schemas)
- Columns: Column names, data types, ordinal positions
- Metadata: Builds embeddings for semantic search
Excluded Schemas:
['pg_catalog', 'information_schema', 'pg_toast']
Excluded Tables:
['migrations', 'schema_embeddings']
2. Value Probing (Optional)
When enabled (JORVIS_VALUE_PROBING_ENABLED=true), the engine samples distinct values from categorical columns to enhance query understanding.
Eligibility Criteria (Whitelist):
| Type | Eligible |
|---|---|
boolean | ✅ Always |
user-defined (ENUM) | ✅ Always |
Columns ending with: _status, _type, _code, _category, _kind, _state, _mode, _level, _priority, _flag | ✅ |
Columns starting with: is_, has_ | ✅ |
Blocklist (Never Probe):
['name', 'email', 'phone', 'address', 'ssn', 'password',
'token', 'secret', 'note', 'comment', 'description',
'text', 'message', 'content', 'body']
3. Security Sanitization
All probed values are sanitized before storage:
| Check | Action |
|---|---|
SQL Keywords (SELECT, DROP, etc.) | Reject |
XSS Patterns (<script>, javascript:) | Reject |
SQL Injection ('--, /*, etc.) | Reject |
| Special Characters | Strip (keep alphanumeric, spaces, basic punctuation) |
| Value Length | Truncate to 50 chars |
| Cardinality | Skip if > 50 distinct values |
API Contract
Primary Method
async discoverAndIndex(): Promise<DiscoveryResult>
Output
interface DiscoveryResult {
status: 'success' | 'error';
indexedTables: number;
durationMs: number;
message?: string;
}
SchemaItem Structure
interface SchemaItem {
tableName: string; // "public.users"
content: string; // "Table: public.users\nColumns: id (integer), ..."
metadata: {
schema: string;
columns: string[];
columnTypes: { name: string; type: string }[];
sampleValues?: Record<string, string[]>; // If probing enabled
};
}
Integration Points
| System | Port | Purpose |
|---|---|---|
SchemaContextPort | index(items) | Vector storage |
DataSource | TypeORM | Database queries |
ConfigService | N/A | Feature flags |
Configuration
# Value probing (advanced)
JORVIS_VALUE_PROBING_ENABLED=false # Default: disabled for security
# Probing limits (hardcoded, change in source)
# MAX_CARDINALITY=50
# MAX_VALUE_LENGTH=50
# PROBE_TIMEOUT_MS=2000
Usage Example
// Trigger discovery (typically on startup or schedule)
const result = await schemaDiscoveryService.discoverAndIndex();
console.log(`Indexed ${result.indexedTables} tables in ${result.durationMs}ms`);
Security Considerations
Value Probing Risk: Even with blocklists, probing can expose sensitive data if column naming conventions are not followed. Enable only in controlled environments.
Recommendations:
- Keep probing disabled by default in production
- Audit blocklist regularly for new PII patterns
- Use TABLESAMPLE (automatic) to limit data exposure
- Set statement timeout (2s default) to prevent long-running probes
Performance
| Operation | Typical Duration |
|---|---|
| Schema introspection (100 tables) | ~500ms |
| Value probing (100 tables) | ~10-30s |
| Embedding generation | Depends on vector store |