ADR-0011: Graph Architecture (GraphRAG)

Status: DEPRECATED (Superseded by ADR-0026) Date: 2026-01-06 Author: Antigravity Task: Task-020

Context

Jorvis currently relies on a Hybrid Retriever (Semantic + Keyword) to find relevant tables for user queries. While effective for simple questions, this approach struggles with:

Multi-hop Reasoning: Queries like "Revenue per Customer in North Region" require traversing Sales -> Customer -> Region. Vector search often misses the intermediate Customer table if it doesn't semantic match the query terms heavily.
Ambiguity: Distinguishing between "Product" (Entity) and "Product Category" (Dimension) is difficult without structural context.
Hallucination Risks: LLMs generate invalid JOINs if they guess relationships not present in the schema.

Decision

We will implement a Knowledge Graph using Apache AGE (A Graph Extension for PostgreSQL).

1. Technology Choice: Apache AGE

Why: It allows us to keep the graph data inside Postgres (or strictly adjacent), leveraging existing backups, tooling, and SQL skills. It supports OpenCypher, the industry standard for graph queries.
Alternatives Considered:
- Neo4j: Powerful but requires a heavy JVM sidecar and separate licensing/infra.
- NetworkX (In-Memory): Too slow for large schemas and lacks persistence.
- Recursive SQL: Too complex to maintain for flexible pathfinding.

2. Graph Schema Design

The Knowledge Graph will consist of the following Nodes and Edges:

Nodes:

(:Table {name: 'schema.table', description: '...'})
(:Column {name: 'col_name', type: 'varchar'})
(:Metric {name: 'Revenue', definition: 'SUM(amount)'}) (Future)
(:BusinessConcept {name: 'Churn'}) (Future)

Edges:

(:Table)-[:HAS_COLUMN]->(:Column)
(:Column)-[:FOREIGN_KEY_TO]->(:Column) (Structural Link)
(:Table)-[:SEMANTICS]->(:BusinessConcept) (Glossary Link)

3. Integration Strategy (Post-Retrieval)

GraphRAG will be applied Post-Retrieval (Reranking/Expansion phase):

Retrieve: Hybrid Retriever gets Top-K tables (e.g., 5).
Expand: Query the Graph to find immediate neighbors (tables joined by FK) of the Top-K.
Prune: Use an LLM or heuristic to keep only relevant neighbors (e.g., Customer connects Sales and Region).
Context: Feed the subgraph schema to the SQL generation prompt.

4. Data Sync

Hydration: A script (schema_to_graph.py) will parse INFORMATION_SCHEMA and Glossary to rebuild the graph periodically (e.g., on deployment or schema change).

Consequences

Positive:
- Explicit modeling of JOIN paths reduces hallucinated SQL.
- Can answer "How are X and Y related?" deterministically.
Negative:
- Introduces a dependency on the age extension (requires compatible Postgres version, currently PG16/17).
- Requires a new sync pipeline (schema_to_graph.py).

Implementation Plan

PoC (Done): Verified AGE (PG17) and metadata ingestion in scripts/research/graphrag-poc/.
Infra (Task-021): Add AGE container to global docker-compose.yml (or enable extension in main DB if supported).
Service (Task-021): Create GraphService in analytics-platform to wrap Cypher queries.
FK Extraction (Task-021): Implement FOREIGN_KEY edge creation from information_schema.key_column_usage.

Approval

Approved by: Claude-code (Architect)
Date: 2026-01-06
Notes: Research phase complete. Implementation deferred to Task-021.