ADR-0011: Graph Architecture (GraphRAG)

Status: DEPRECATED (Superseded by ADR-0026) Date: 2026-01-06 Author: Antigravity Task: Task-020

Context

Jorvis currently relies on a Hybrid Retriever (Semantic + Keyword) to find relevant tables for user queries. While effective for simple questions, this approach struggles with:

  1. Multi-hop Reasoning: Queries like "Revenue per Customer in North Region" require traversing Sales -> Customer -> Region. Vector search often misses the intermediate Customer table if it doesn't semantic match the query terms heavily.
  2. Ambiguity: Distinguishing between "Product" (Entity) and "Product Category" (Dimension) is difficult without structural context.
  3. Hallucination Risks: LLMs generate invalid JOINs if they guess relationships not present in the schema.

Decision

We will implement a Knowledge Graph using Apache AGE (A Graph Extension for PostgreSQL).

1. Technology Choice: Apache AGE

  • Why: It allows us to keep the graph data inside Postgres (or strictly adjacent), leveraging existing backups, tooling, and SQL skills. It supports OpenCypher, the industry standard for graph queries.
  • Alternatives Considered:
    • Neo4j: Powerful but requires a heavy JVM sidecar and separate licensing/infra.
    • NetworkX (In-Memory): Too slow for large schemas and lacks persistence.
    • Recursive SQL: Too complex to maintain for flexible pathfinding.

2. Graph Schema Design

The Knowledge Graph will consist of the following Nodes and Edges:

Nodes:

  • (:Table {name: 'schema.table', description: '...'})
  • (:Column {name: 'col_name', type: 'varchar'})
  • (:Metric {name: 'Revenue', definition: 'SUM(amount)'}) (Future)
  • (:BusinessConcept {name: 'Churn'}) (Future)

Edges:

  • (:Table)-[:HAS_COLUMN]->(:Column)
  • (:Column)-[:FOREIGN_KEY_TO]->(:Column) (Structural Link)
  • (:Table)-[:SEMANTICS]->(:BusinessConcept) (Glossary Link)

3. Integration Strategy (Post-Retrieval)

GraphRAG will be applied Post-Retrieval (Reranking/Expansion phase):

  1. Retrieve: Hybrid Retriever gets Top-K tables (e.g., 5).
  2. Expand: Query the Graph to find immediate neighbors (tables joined by FK) of the Top-K.
  3. Prune: Use an LLM or heuristic to keep only relevant neighbors (e.g., Customer connects Sales and Region).
  4. Context: Feed the subgraph schema to the SQL generation prompt.

4. Data Sync

  • Hydration: A script (schema_to_graph.py) will parse INFORMATION_SCHEMA and Glossary to rebuild the graph periodically (e.g., on deployment or schema change).

Consequences

  • Positive:
    • Explicit modeling of JOIN paths reduces hallucinated SQL.
    • Can answer "How are X and Y related?" deterministically.
  • Negative:
    • Introduces a dependency on the age extension (requires compatible Postgres version, currently PG16/17).
    • Requires a new sync pipeline (schema_to_graph.py).

Implementation Plan

  1. PoC (Done): Verified AGE (PG17) and metadata ingestion in scripts/research/graphrag-poc/.
  2. Infra (Task-021): Add AGE container to global docker-compose.yml (or enable extension in main DB if supported).
  3. Service (Task-021): Create GraphService in analytics-platform to wrap Cypher queries.
  4. FK Extraction (Task-021): Implement FOREIGN_KEY edge creation from information_schema.key_column_usage.

Approval

  • Approved by: Claude-code (Architect)
  • Date: 2026-01-06
  • Notes: Research phase complete. Implementation deferred to Task-021.