DataSpoc Platform

DataSpoc is a data platform built for both humans and AI agents. It turns any data source into a queryable data lake using three CLI tools connected by Parquet files in your cloud bucket.

Three Products, One Platform

Pipe --- Ingestion (Open-Source)

Pipe connects to 400+ data sources and writes Parquet files to your bucket. It handles incremental extraction, schema detection, and partitioning out of the box.

Apache 2.0 license
github.com/dataspoclab/dataspoc-pipe

Lens --- Query (Open-Source)

Lens mounts your bucket as a SQL database. Query with SQL, explore in Jupyter or Marimo notebooks, or ask questions in natural language with AI.

Apache 2.0 license
github.com/dataspoclab/dataspoc-lens

ML --- AutoML (Commercial)

ML reads Parquet from the bucket, runs automated feature engineering, trains models, and writes predictions back as Parquet for Lens to query.

How They Connect

Source ──► [Pipe] ──► Parquet in Bucket ──► [Lens] ──► SQL / Jupyter / AI
                                              │
                                           [ML] ──► train / predict
                                              │
                                           [MCP] ──► Claude / Cursor / Windsurf

All communication between products happens through Parquet files in a shared bucket. Pipe writes, Lens reads, ML reads and writes. No product imports code from another.

Key Metrics

Metric	Value
Data sources supported	400+
Time to first query	15 minutes
Cost to start	$0

Three Ways to Use It

Terminal --- dataspoc-pipe run and dataspoc-lens shell from any shell
Python --- Import LensClient or PipeClient in your scripts and agents
MCP for AI agents --- Connect Claude Desktop, Claude Code, Cursor, or Windsurf directly to your data lake

GitHub

dataspoc-pipe --- Ingestion CLI
dataspoc-lens --- Query CLI