Skip to content

Architecture

DataSpoc’s three products --- Pipe, Lens, and ML --- never import each other’s code. They communicate exclusively through Parquet files in a shared bucket following a strict directory convention.

The bucket structure is the interface between all products. This is the sacred contract --- it must never change without versioning.

bucket/
.dataspoc/
manifest.json # Catalog (Pipe writes, Lens reads)
state/<pipeline>/state.json # Incremental bookmarks (Pipe only)
logs/<pipeline>/<timestamp>.json # Execution logs (Pipe only)
raw/<source>/<table>/ # Raw data (Pipe writes)
dt=YYYY-MM-DD/
*.parquet
curated/<domain>/<table>/ # Cleaned data (Pipe transforms)
dt=YYYY-MM-DD/
*.parquet
gold/<domain>/<table>/ # Analyst aggregations (Lens transforms)
*.parquet
ml/models/<model>/ # ML artifacts (ML writes, Lens reads)
model.pkl
features.json
metrics.json
ml/predictions/<model>/ # ML predictions (ML writes, Lens reads)
*.parquet

The manifest is the catalog of everything in the bucket. Pipe writes it on every run. Lens reads it to discover tables, schemas, and partitions. ML reads it to find training data.

LayerWho writesWho readsTool
RawData EngineersData EngineersPipe (ingest)
CuratedData EngineersAnalystsPipe (transform)
GoldAnalystsEveryoneLens (SQL transforms)
ML ModelsData ScientistsDS, APIML (train)
ML PredictionsData ScientistsAnalystsML (predict), Lens (query)

Pipe writes raw data exactly as extracted from the source. One directory per source, one subdirectory per table, partitioned by date.

Pipe transforms clean and normalize raw data. Deduplication, type casting, null handling. Organized by business domain rather than source system.

Lens transforms create analyst-ready aggregations. Revenue summaries, KPI tables, dashboard feeds. Readable by everyone in the organization.

ML reads from curated or gold layers, trains models, and writes artifacts and predictions back to the bucket. Lens can then query predictions like any other table.

DataSpoc never implements authentication. Access control is handled entirely by cloud IAM.

BucketAccessUsers
s3://company-bronzeDE onlyData Engineers
s3://company-financeFinance teamAnalysts, CFO
s3://company-hrHR teamHR Analysts
s3://company-productProduct teamPMs, Product Analysts

One bucket per permission boundary. Your cloud provider enforces who can read and write.

These rules ensure the products stay decoupled and the bucket contract remains the only integration point:

  1. Pipe never imports Lens or ML code --- communication is via bucket only
  2. Lens never imports Pipe or ML code --- reads bucket, calls ML via subprocess
  3. ML never imports Pipe or Lens code --- reads and writes Parquet in bucket
  4. Platform never imports Pipe, Lens, or ML code --- calls ML via subprocess
  5. All CLI messages in English
  6. All repos use Python 3.10+, Typer, Pydantic, pytest, uv
  7. No secrets in any repo --- environment variables or cloud IAM only