Data Sources Implementation Status

Version: 3.0 Updated: 2026-03-22 Status: Canonical current-state inventory after the shipped data-surface and embedding-v2 adoption waves Primary audience: New agents, architects, gatekeepers, executors


Purpose

This document is the canonical runtime-grounded inventory of what Jorvis supports today and what remains absent.

Use this document before making roadmap claims, Stage 0 specs, or investor/demo statements.

Related docs:

  • docs/architecture/DATA_SOURCES_STRATEGY.md — post-wave planning strategy after the shipped expansion wave
  • docs/handoff/HANDOFF_TO_NEXT_AGENT.md — current runtime snapshot and onboarding order
  • docs/agent_ops/TASK_BOARD.md — active and proposed task lanes

Current Truth Summary

Verified current surfaces

  • Redshift
  • BigQuery
  • Snowflake
  • PostgreSQL first-class provider
  • MySQL first-class provider
  • Google Sheets read-only connector
  • Google Docs read-only connector
  • Google Drive read-only connector
  • Direct text/markdown ingest
  • Upload-based file processing
  • OCR image extraction
  • Voice STT/TTS
  • Internal pgvector-based retrieval
  • Gemini Embedding 2 adoption lane completed; full rerun remains the next validation pass

Still unsupported as first-class product paths

  • Google Slides read-only ingestion
  • Legacy .ppt support
  • Structured CSV parsing as a first-class processor
  • SQL Server / Oracle / SQLite
  • MongoDB / Elasticsearch / OpenSearch / external vector / graph / time-series connectors

File And Document Inputs

TypeActual StatusCurrent Module(s)Main Blocker / ConstraintRecommended Next Step
Google SheetsVerified current read-only connectoranalytics-platform/src/modules/external/google-sheets/google-sheets.service.ts, analytics-platform/src/modules/mcp/tools/google-sheets.tool.tsRead-only only; not a full spreadsheet platformCurrent surface
Google DocsVerified current read-only connectoranalytics-platform/src/modules/external/google-docs-drive/google-docs-drive.service.ts, related tools registry wiringService-account/shared-access only; no end-user OAuth; no write pathCurrent surface
Google SlidesMissingNo active connector moduleNo service, no toolFresh Stage 0 only if product need appears
Google DriveVerified current read-only connectoranalytics-platform/src/modules/external/google-docs-drive/google-docs-drive.service.ts, related tools registry wiringRead-only only; no write/update/delete; no broad auth redesignCurrent surface
Text / MarkdownVerified current direct-ingest surfaceanalytics-platform/src/document/document.controller.ts, analytics-platform/src/document/document.service.ts, analytics-platform/src/document/processors/text.processor.tsInternal service-token path, not equivalent to general end-user upload UXCurrent surface
CSVVerified current raw-text upload handlinganalytics-platform/src/document/processors/text.processor.tsUpload path treats CSV as raw text, not structured parsingCurrent surface; structured CSV remains future work
JSONVerified current file processoranalytics-platform/src/document/processors/json.processor.ts, analytics-platform/src/document/processors/processor.factory.tsCurrent upload path semantics applyCurrent surface
HTMLVerified current file processoranalytics-platform/src/document/processors/html.processor.ts, analytics-platform/src/document/processors/processor.factory.tsCurrent upload path semantics applyCurrent surface
DOCXVerified current file processoranalytics-platform/src/document/processors/office.processor.tsCurrent upload path semantics applyCurrent surface
XLSX / XLSVerified current file processoranalytics-platform/src/document/processors/office.processor.ts, analytics-platform/src/modules/external/excel/excel.service.tsCurrent upload path semantics apply; helper path existsCurrent surface
PPTXVerified current file processoranalytics-platform/src/document/processors/pptx.processor.ts, analytics-platform/src/document/processors/processor.factory.ts.ppt still unsupportedCurrent surface
PPT (legacy)MissingNo processorNo parser; explicitly out of Lane 4 scopeNot now
PDFVerified current file processoranalytics-platform/src/document/processors/pdf.processor.tsScanned PDFs remain limited by OCR quality, not by product enablementCurrent surface
ImagesVerified current OCR surfaceanalytics-platform/src/document/processors/ocr.processor.tsOCR-only for image inputsCurrent surface
AudioStrong as voice I/O, not as knowledge sourceanalytics-platform/src/voice/voice.module.ts, analytics-platform/src/voice/audio.service.tsNo audio-as-document retrieval laneNo new lane now

Database And Connector Surface

TypeActual StatusCurrent Module(s)Main Blocker / ConstraintRecommended Next Step
PostgreSQLVerified production surfaceanalytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/data.service.ts, analytics-platform/src/data/secrets.service.tsNone beyond normal hardeningCurrent surface
RedshiftVerified production surfaceanalytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/redshift.service.tsNoneCurrent surface
BigQueryVerified production surfaceanalytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/secrets.service.tsNoneCurrent surface
SnowflakeVerified production surfaceanalytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/secrets.service.tsNoneCurrent surface
MySQLVerified production surfaceanalytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/data.service.ts, analytics-platform/src/data/secrets.service.tsRuntime path itself is first-classCurrent surface
SQL ServerMissingNo runtime providerNo driver/provider pathFresh Stage 0 only if product need appears
OracleMissingNo runtime providerNo driver/provider pathNot now
SQLiteMissingNo runtime providerNo provider pathNot now
MongoDBMissingNo user-facing connectorAbsent connector/query layerNot now
Redis as data sourceMissing / infra onlyInfra mentions onlyNot a user-queryable source pathNot now
Neo4j / external graph DBsMissing as user-facing sourceInternal graph facilities onlyNo user-queryable connectorNot now
Elasticsearch / OpenSearch / SplunkMissingNo connectorAbsent source pathNot now
Time-series DBsMissingNo connectorAbsent source pathNot now
External vector DBsMissingInternal pgvector onlyNo vector-store abstraction / connectorNot now

Important Nuances New Agents Must Not Miss

1. Direct text ingest and file upload are both live, but they are distinct paths

  • POST /v1/documents/ingest remains available through the internal direct-ingest path.
  • POST /v1/documents/upload is a separate upload/file-processing path and is now part of the verified current product surface.

Do not collapse these into one statement like “the whole document lane is enabled” or “the whole document lane is unavailable.”

2. Upload-based file processing is a verified product surface

The upload-based path now includes JSON, HTML, PPTX, PDF, DOCX, XLS/XLSX, CSV-as-raw-text, and image/OCR handling.

That means these processors exist in code, are part of the shipped source, and should be described as current product surfaces rather than future work.

3. Google Docs / Drive are real, but narrowly scoped

Google Docs and Drive support are genuine, merged read-only connectors. They are also explicitly bounded:

  • service-account / shared-access only
  • no end-user OAuth
  • no write/update/delete path

Do not overclaim broader auth or edit support.

4. PPTX means .pptx, not legacy .ppt

The shipped processor is for Office Open XML .pptx files only. Do not claim support for legacy binary PowerPoint .ppt.

5. Voice is strong; audio-as-knowledge is not

The voice pipeline is one of the most mature subsystems in the repo. That does not mean uploaded audio files are already a first-class knowledge-retrieval source.

6. Redis is aspirational infra here

ioredis is installed, but the live services still use in-memory maps in the current runtime. Do not describe Redis as an active Jorvis data source or active cache backend without additional implementation work.

7. The current embedding baseline remains text-only Gemini 001 at 768 dims

Gemini Embedding 2 adoption work is complete, but the live baseline remains the current verified 001 lane until the corpus-refresh rerun is executed:

  • current adapter: gemini-embedding-001
  • current normalized vector dimension: 768
  • Gemini Embedding 2: adoption lane complete; rerun remains the next verification pass

Treat the embedding-v2 rerun as validation of the refreshed corpus, not as a new adoption decision.


Current Internal / Gated / Infra Distinctions

TopicReal StatePrimary Reference
Direct text ingestLiveanalytics-platform/src/document/document.controller.ts, analytics-platform/src/document/document.service.ts
Upload-based file processingLive and verifiedanalytics-platform/src/document/document.service.ts, docs/architecture/FEATURE_FLAGS.md
OCRLive and verifiedanalytics-platform/src/document/processors/ocr.processor.ts, docs/architecture/FEATURE_FLAGS.md
Google Docs / Drive auth boundaryRead-only, service-account/shared-access onlyanalytics-platform/src/modules/external/google-docs-drive/google-docs-drive.service.ts, docs/operations/GOOGLE_DOCS_DRIVE_READONLY.md
Embedding baselineGemini text-only, 768-dimanalytics-platform/src/ai/embedding/gemini-embedding.adapter.ts, analytics-platform/src/migrations/1772391946000-AlignVectorDimensionsTo768.ts
Gemini Embedding 2Adoption lane complete; full rerun pending as verificationdocs/architecture/adr/ADR-0029-openclaw-operator-trust-and-deferred-integration-boundaries.md, docs/agent_ops/specs/task_gemini_embedding2_live_eval_stage0_spec.md
RedisInstalled dependency, not active runtime backend in core servicesanalytics-platform/package.json, analytics-platform/src/resilience/rate-limiter.service.ts

Post-Wave Position

The original six-lane data-surface wave is now largely consumed:

LaneOutcomeNotes
Lane 1 — PostgreSQLCOMPLETE + deployedFirst-class provider is live in shipped source
Lane 2 — Google Docs / Drive ROCOMPLETE + deployedRead-only connector with explicit read-only auth boundaries
Lane 3 — JSON + HTMLCOMPLETE + deployedFirst-class processors in the current upload path
Lane 4 — PPTXCOMPLETE + deployed.pptx only; shipped via the verified upload path
Lane 5 — MySQLCOMPLETE + deployedFirst-class provider is live in shipped source
Lane 6 — Gemini Embedding 2COMPLETE + verified (migration lane)Current verified 001 baseline remains live until the corpus-refresh rerun completes

There is no automatic next-wave connector plan carried forward from the original March 12 kickoff packets. Any new connector/document-expansion work now requires a fresh Stage 0 and explicit GO.


Remaining Opportunities (Not Automatically Active)

  • Google Slides read-only ingestion, if a real product/use-case need appears
  • Structured CSV parsing, if raw-text CSV stops being sufficient
  • SQL Server / Oracle / SQLite connector expansion
  • External search/vector/graph/time-series connector work
  • Any new embedding-family research beyond the current verified baseline

These are possible future slices, not active roadmap commitments.


Bottom Line

Jorvis now has:

  • a strong warehouse SQL path,
  • first-class PostgreSQL and MySQL support,
  • real Google Sheets plus Google Docs / Drive read-only ingestion,
  • first-class JSON / HTML / PPTX / PDF / DOCX / XLSX processors within the verified upload path,
  • a mature voice stack,
  • and a current verified embedding baseline that remains gemini-embedding-001 until the rerun completes.

The most important truths new agents should carry forward are:

  1. the original six-lane expansion wave is already consumed,
  2. Lane 6 migration work is complete, but the corpus-refresh rerun is still the next validation step,
  3. future connector or ingestion expansion needs a fresh Stage 0,
  4. upload-based file support is now part of the verified current product surface.