Architecture

DataSpoc’s three products --- Pipe, Lens, and ML --- never import each other’s code. They communicate exclusively through Parquet files in a shared bucket following a strict directory convention.

The Bucket Contract

The bucket structure is the interface between all products. This is the sacred contract --- it must never change without versioning.

bucket/
  .dataspoc/
    manifest.json                       # Catalog (Pipe writes, Lens reads)
    state/<pipeline>/state.json         # Incremental bookmarks (Pipe only)
    logs/<pipeline>/<timestamp>.json    # Execution logs (Pipe only)
  raw/<source>/<table>/                 # Raw data (Pipe writes)
    dt=YYYY-MM-DD/
      *.parquet
  curated/<domain>/<table>/             # Cleaned data (Pipe transforms)
    dt=YYYY-MM-DD/
      *.parquet
  gold/<domain>/<table>/                # Analyst aggregations (Lens transforms)
    *.parquet
  ml/models/<model>/                    # ML artifacts (ML writes, Lens reads)
    model.pkl
    features.json
    metrics.json
  ml/predictions/<model>/              # ML predictions (ML writes, Lens reads)
    *.parquet

manifest.json

The manifest is the catalog of everything in the bucket. Pipe writes it on every run. Lens reads it to discover tables, schemas, and partitions. ML reads it to find training data.

Data Lake Layers

Layer	Who writes	Who reads	Tool
Raw	Data Engineers	Data Engineers	Pipe (ingest)
Curated	Data Engineers	Analysts	Pipe (transform)
Gold	Analysts	Everyone	Lens (SQL transforms)
ML Models	Data Scientists	DS, API	ML (train)
ML Predictions	Data Scientists	Analysts	ML (predict), Lens (query)

Raw

Pipe writes raw data exactly as extracted from the source. One directory per source, one subdirectory per table, partitioned by date.

Curated

Pipe transforms clean and normalize raw data. Deduplication, type casting, null handling. Organized by business domain rather than source system.

Gold

Lens transforms create analyst-ready aggregations. Revenue summaries, KPI tables, dashboard feeds. Readable by everyone in the organization.

ML

ML reads from curated or gold layers, trains models, and writes artifacts and predictions back to the bucket. Lens can then query predictions like any other table.

Access Control

DataSpoc never implements authentication. Access control is handled entirely by cloud IAM.

Bucket	Access	Users
`s3://company-bronze`	DE only	Data Engineers
`s3://company-finance`	Finance team	Analysts, CFO
`s3://company-hr`	HR team	HR Analysts
`s3://company-product`	Product team	PMs, Product Analysts

One bucket per permission boundary. Your cloud provider enforces who can read and write.

Cross-Product Rules

These rules ensure the products stay decoupled and the bucket contract remains the only integration point:

Pipe never imports Lens or ML code --- communication is via bucket only
Lens never imports Pipe or ML code --- reads bucket, calls ML via subprocess
ML never imports Pipe or Lens code --- reads and writes Parquet in bucket
Platform never imports Pipe, Lens, or ML code --- calls ML via subprocess
All CLI messages in English
All repos use Python 3.10+, Typer, Pydantic, pytest, uv
No secrets in any repo --- environment variables or cloud IAM only