Data Sources Implementation Status
Data Sources Implementation Status
Version: 3.0 Updated: 2026-03-22 Status: Canonical current-state inventory after the shipped data-surface and embedding-v2 adoption waves Primary audience: New agents, architects, gatekeepers, executors
Purpose
This document is the canonical runtime-grounded inventory of what Jorvis supports today and what remains absent.
Use this document before making roadmap claims, Stage 0 specs, or investor/demo statements.
Related docs:
docs/architecture/DATA_SOURCES_STRATEGY.md— post-wave planning strategy after the shipped expansion wavedocs/handoff/HANDOFF_TO_NEXT_AGENT.md— current runtime snapshot and onboarding orderdocs/agent_ops/TASK_BOARD.md— active and proposed task lanes
Current Truth Summary
Verified current surfaces
- Redshift
- BigQuery
- Snowflake
- PostgreSQL first-class provider
- MySQL first-class provider
- Google Sheets read-only connector
- Google Docs read-only connector
- Google Drive read-only connector
- Direct text/markdown ingest
- Upload-based file processing
- OCR image extraction
- Voice STT/TTS
- Internal pgvector-based retrieval
- Gemini Embedding 2 adoption lane completed; full rerun remains the next validation pass
Still unsupported as first-class product paths
- Google Slides read-only ingestion
- Legacy
.pptsupport - Structured CSV parsing as a first-class processor
- SQL Server / Oracle / SQLite
- MongoDB / Elasticsearch / OpenSearch / external vector / graph / time-series connectors
File And Document Inputs
| Type | Actual Status | Current Module(s) | Main Blocker / Constraint | Recommended Next Step |
|---|---|---|---|---|
| Google Sheets | Verified current read-only connector | analytics-platform/src/modules/external/google-sheets/google-sheets.service.ts, analytics-platform/src/modules/mcp/tools/google-sheets.tool.ts | Read-only only; not a full spreadsheet platform | Current surface |
| Google Docs | Verified current read-only connector | analytics-platform/src/modules/external/google-docs-drive/google-docs-drive.service.ts, related tools registry wiring | Service-account/shared-access only; no end-user OAuth; no write path | Current surface |
| Google Slides | Missing | No active connector module | No service, no tool | Fresh Stage 0 only if product need appears |
| Google Drive | Verified current read-only connector | analytics-platform/src/modules/external/google-docs-drive/google-docs-drive.service.ts, related tools registry wiring | Read-only only; no write/update/delete; no broad auth redesign | Current surface |
| Text / Markdown | Verified current direct-ingest surface | analytics-platform/src/document/document.controller.ts, analytics-platform/src/document/document.service.ts, analytics-platform/src/document/processors/text.processor.ts | Internal service-token path, not equivalent to general end-user upload UX | Current surface |
| CSV | Verified current raw-text upload handling | analytics-platform/src/document/processors/text.processor.ts | Upload path treats CSV as raw text, not structured parsing | Current surface; structured CSV remains future work |
| JSON | Verified current file processor | analytics-platform/src/document/processors/json.processor.ts, analytics-platform/src/document/processors/processor.factory.ts | Current upload path semantics apply | Current surface |
| HTML | Verified current file processor | analytics-platform/src/document/processors/html.processor.ts, analytics-platform/src/document/processors/processor.factory.ts | Current upload path semantics apply | Current surface |
| DOCX | Verified current file processor | analytics-platform/src/document/processors/office.processor.ts | Current upload path semantics apply | Current surface |
| XLSX / XLS | Verified current file processor | analytics-platform/src/document/processors/office.processor.ts, analytics-platform/src/modules/external/excel/excel.service.ts | Current upload path semantics apply; helper path exists | Current surface |
| PPTX | Verified current file processor | analytics-platform/src/document/processors/pptx.processor.ts, analytics-platform/src/document/processors/processor.factory.ts | .ppt still unsupported | Current surface |
| PPT (legacy) | Missing | No processor | No parser; explicitly out of Lane 4 scope | Not now |
| Verified current file processor | analytics-platform/src/document/processors/pdf.processor.ts | Scanned PDFs remain limited by OCR quality, not by product enablement | Current surface | |
| Images | Verified current OCR surface | analytics-platform/src/document/processors/ocr.processor.ts | OCR-only for image inputs | Current surface |
| Audio | Strong as voice I/O, not as knowledge source | analytics-platform/src/voice/voice.module.ts, analytics-platform/src/voice/audio.service.ts | No audio-as-document retrieval lane | No new lane now |
Database And Connector Surface
| Type | Actual Status | Current Module(s) | Main Blocker / Constraint | Recommended Next Step |
|---|---|---|---|---|
| PostgreSQL | Verified production surface | analytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/data.service.ts, analytics-platform/src/data/secrets.service.ts | None beyond normal hardening | Current surface |
| Redshift | Verified production surface | analytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/redshift.service.ts | None | Current surface |
| BigQuery | Verified production surface | analytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/secrets.service.ts | None | Current surface |
| Snowflake | Verified production surface | analytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/secrets.service.ts | None | Current surface |
| MySQL | Verified production surface | analytics-platform/src/data/db-factory.service.ts, analytics-platform/src/data/data.service.ts, analytics-platform/src/data/secrets.service.ts | Runtime path itself is first-class | Current surface |
| SQL Server | Missing | No runtime provider | No driver/provider path | Fresh Stage 0 only if product need appears |
| Oracle | Missing | No runtime provider | No driver/provider path | Not now |
| SQLite | Missing | No runtime provider | No provider path | Not now |
| MongoDB | Missing | No user-facing connector | Absent connector/query layer | Not now |
| Redis as data source | Missing / infra only | Infra mentions only | Not a user-queryable source path | Not now |
| Neo4j / external graph DBs | Missing as user-facing source | Internal graph facilities only | No user-queryable connector | Not now |
| Elasticsearch / OpenSearch / Splunk | Missing | No connector | Absent source path | Not now |
| Time-series DBs | Missing | No connector | Absent source path | Not now |
| External vector DBs | Missing | Internal pgvector only | No vector-store abstraction / connector | Not now |
Important Nuances New Agents Must Not Miss
1. Direct text ingest and file upload are both live, but they are distinct paths
POST /v1/documents/ingestremains available through the internal direct-ingest path.POST /v1/documents/uploadis a separate upload/file-processing path and is now part of the verified current product surface.
Do not collapse these into one statement like “the whole document lane is enabled” or “the whole document lane is unavailable.”
2. Upload-based file processing is a verified product surface
The upload-based path now includes JSON, HTML, PPTX, PDF, DOCX, XLS/XLSX, CSV-as-raw-text, and image/OCR handling.
That means these processors exist in code, are part of the shipped source, and should be described as current product surfaces rather than future work.
3. Google Docs / Drive are real, but narrowly scoped
Google Docs and Drive support are genuine, merged read-only connectors. They are also explicitly bounded:
- service-account / shared-access only
- no end-user OAuth
- no write/update/delete path
Do not overclaim broader auth or edit support.
4. PPTX means .pptx, not legacy .ppt
The shipped processor is for Office Open XML .pptx files only.
Do not claim support for legacy binary PowerPoint .ppt.
5. Voice is strong; audio-as-knowledge is not
The voice pipeline is one of the most mature subsystems in the repo. That does not mean uploaded audio files are already a first-class knowledge-retrieval source.
6. Redis is aspirational infra here
ioredis is installed, but the live services still use in-memory maps in the current runtime.
Do not describe Redis as an active Jorvis data source or active cache backend without additional implementation work.
7. The current embedding baseline remains text-only Gemini 001 at 768 dims
Gemini Embedding 2 adoption work is complete, but the live baseline remains the current verified 001 lane until the corpus-refresh rerun is executed:
- current adapter:
gemini-embedding-001 - current normalized vector dimension:
768 - Gemini Embedding 2: adoption lane complete; rerun remains the next verification pass
Treat the embedding-v2 rerun as validation of the refreshed corpus, not as a new adoption decision.
Current Internal / Gated / Infra Distinctions
| Topic | Real State | Primary Reference |
|---|---|---|
| Direct text ingest | Live | analytics-platform/src/document/document.controller.ts, analytics-platform/src/document/document.service.ts |
| Upload-based file processing | Live and verified | analytics-platform/src/document/document.service.ts, docs/architecture/FEATURE_FLAGS.md |
| OCR | Live and verified | analytics-platform/src/document/processors/ocr.processor.ts, docs/architecture/FEATURE_FLAGS.md |
| Google Docs / Drive auth boundary | Read-only, service-account/shared-access only | analytics-platform/src/modules/external/google-docs-drive/google-docs-drive.service.ts, docs/operations/GOOGLE_DOCS_DRIVE_READONLY.md |
| Embedding baseline | Gemini text-only, 768-dim | analytics-platform/src/ai/embedding/gemini-embedding.adapter.ts, analytics-platform/src/migrations/1772391946000-AlignVectorDimensionsTo768.ts |
| Gemini Embedding 2 | Adoption lane complete; full rerun pending as verification | docs/architecture/adr/ADR-0029-openclaw-operator-trust-and-deferred-integration-boundaries.md, docs/agent_ops/specs/task_gemini_embedding2_live_eval_stage0_spec.md |
| Redis | Installed dependency, not active runtime backend in core services | analytics-platform/package.json, analytics-platform/src/resilience/rate-limiter.service.ts |
Post-Wave Position
The original six-lane data-surface wave is now largely consumed:
| Lane | Outcome | Notes |
|---|---|---|
| Lane 1 — PostgreSQL | COMPLETE + deployed | First-class provider is live in shipped source |
| Lane 2 — Google Docs / Drive RO | COMPLETE + deployed | Read-only connector with explicit read-only auth boundaries |
| Lane 3 — JSON + HTML | COMPLETE + deployed | First-class processors in the current upload path |
| Lane 4 — PPTX | COMPLETE + deployed | .pptx only; shipped via the verified upload path |
| Lane 5 — MySQL | COMPLETE + deployed | First-class provider is live in shipped source |
| Lane 6 — Gemini Embedding 2 | COMPLETE + verified (migration lane) | Current verified 001 baseline remains live until the corpus-refresh rerun completes |
There is no automatic next-wave connector plan carried forward from the original March 12 kickoff packets. Any new connector/document-expansion work now requires a fresh Stage 0 and explicit GO.
Remaining Opportunities (Not Automatically Active)
- Google Slides read-only ingestion, if a real product/use-case need appears
- Structured CSV parsing, if raw-text CSV stops being sufficient
- SQL Server / Oracle / SQLite connector expansion
- External search/vector/graph/time-series connector work
- Any new embedding-family research beyond the current verified baseline
These are possible future slices, not active roadmap commitments.
Bottom Line
Jorvis now has:
- a strong warehouse SQL path,
- first-class PostgreSQL and MySQL support,
- real Google Sheets plus Google Docs / Drive read-only ingestion,
- first-class JSON / HTML / PPTX / PDF / DOCX / XLSX processors within the verified upload path,
- a mature voice stack,
- and a current verified embedding baseline that remains
gemini-embedding-001until the rerun completes.
The most important truths new agents should carry forward are:
- the original six-lane expansion wave is already consumed,
- Lane 6 migration work is complete, but the corpus-refresh rerun is still the next validation step,
- future connector or ingestion expansion needs a fresh Stage 0,
- upload-based file support is now part of the verified current product surface.