Open Source — Apache 2.0

The data platform for humans and AI agents.

Every data team starts the same way: 3 months setting up Airflow, dbt, and a warehouse before anyone runs a query. DataSpoc is the shortcut. Three CLI tools. One pip install. Your data stays in your bucket. Your AI agent queries it via MCP.

pip install dataspoc-pipe dataspoc-lens

An AI agent for every role.

DataSpoc ships with AGENT.md — a skill file that teaches AI agents how to use your data platform. Drop it into Claude, Cursor, or any MCP client and watch your team accelerate.

DE Agent

Data Engineer Agent

Ingests data from any source. Monitors pipelines. Detects failures and retries. Adds new sources when you ask. Your always-on data engineer that never takes PTO.

# Agent reads AGENT.md, connects via MCP
"Add our Stripe API as a source
 and schedule it every 6 hours"
→ dataspoc-pipe add stripe
→ dataspoc-pipe run stripe
→ dataspoc-pipe schedule install
MCP SDK AGENT.md
📊

DA Agent

Data Analyst Agent

Explores your data lake. Answers business questions in plain English. Builds reports. Refreshes cache before querying. Your analyst that works at 3am without complaining.

# Agent reads AGENT.md, connects via MCP
"Which customers are at risk of
 churning? Export the list as CSV"
→ cache_refresh_stale()
→ ask("customers with churn risk")
→ query("SELECT ...") → export
MCP SDK AGENT.md
🧠

ML Agent

ML Engineer Agent

Trains models on your lake data. Generates predictions. Explains results. Monitors drift. Your ML engineer that turns "can we predict X?" into a model in minutes.

# Agent reads AGENT.md, connects via MCP
"Train a churn model on our
 customer data and explain it"
→ ml train --target churn --from customers
→ ml explain --model churn
→ ml predict --model churn --from new
MCP SDK AGENT.md
AGENT.md + MCP + SDK

Every DataSpoc repo ships with an AGENT.md — a skill file that documents every function, pattern, and constraint. AI agents read it and know exactly what to do. No custom integration code. No prompt engineering. Just drop the file and go.

Sound familiar?

These are the stories we hear every week from data teams.

"2 months just to move CSVs"

You spent 2 months setting up Airflow, debugging Docker containers, and writing DAGs — just to move CSV files to S3. The business still has no dashboard.

"The warehouse costs more than the insights"

Your data warehouse bill hit $4k/month. The CFO asks what it produces. You look at the dashboards. Three people use them.

"Every AI tool needs a custom wrapper"

You want Claude to query your data. So you build a custom API, a vector store, a retrieval pipeline... just to answer "what were last month's sales?"

"Analysts wait days for a query"

Your analyst has a question. They file a ticket. The data engineer writes a query. Three days later, the answer is "42." The moment has passed.

What if your data platform was just pip install?

The old way is expensive, slow, and fragile. There is a simpler path.

BEFORE

Airflow

+ dbt

+ Snowflake

+ Looker

+ custom API for AI agents

6 months + $50k/year

AFTER

pip install dataspoc-pipe

pip install dataspoc-lens

Ingest, query, AI — done.

 

 

15 minutes + $0

How it works

Three steps. No infrastructure to provision, no accounts to create, no YAML to debug.

1

Pipe ingests

Connect any source. Data lands as Parquet in your bucket.

$ dataspoc-pipe add my-postgres
$ dataspoc-pipe run my-postgres
# → Parquet files in s3://bucket/raw/
2

Lens queries

Ask questions in SQL or plain English. Instant results.

$ dataspoc-lens ask "top 10 customers by revenue"
# → SQL generated, results displayed
3

Agents connect

One command turns your data lake into an MCP server for AI.

$ dataspoc-lens mcp
# → Claude, Cursor, any agent queries your data

Three tools. One bucket.

Each tool does one job well. They connect through Parquet files in your cloud storage.

P

Pipe

Data Ingestion

"When I need data from a source, I want it in my bucket as Parquet — without managing infrastructure."

400+ Singer sources. Streaming and incremental. Auto-catalog. S3, GCS, Azure.

$ pip install dataspoc-pipe
L

Lens

Data Query Engine

"When I have a question about my data, I want to ask it in SQL or plain English — without spinning up a warehouse."

DuckDB-powered. Interactive shell, Jupyter, Marimo. AI queries via natural language. MCP server.

$ pip install dataspoc-lens
M

ML

AutoML

"When I need predictions, I want to train a model on my lake data — without being a data scientist."

Automated feature engineering, model selection, training, and prediction on Parquet data.

$ dataspoc-lens ml train

Built for your team

From the data engineer who builds pipelines to the CTO who signs off on the budget.

Data Engineer

Stop writing Airflow DAGs

One command to ingest from any source. No containers, no schedulers, no YAML. Just pipe run.

Data Analyst

Ask questions in English

Type your question. Get SQL + results. No ticket, no waiting, no context switching. Just lens ask.

Platform Team

One tool for humans and AI

Same CLI, same data, for analysts and AI agents. MCP-native. No infrastructure to manage, no API layer to build.

Founder / CTO

Data platform in 15 minutes

$0 to start. Open source. No vendor lock-in. Your data stays in your bucket. Scale when ready.

400+

Singer data sources

DuckDB

Powered query engine

Apache 2.0

Open source license

MCP

Native for AI agents

PyPI

pip install & go

Start in 5 minutes.
Not 5 months.

Four commands. That is it. Your data goes from source to queryable lake — for humans and AI agents — in the time it takes to make coffee.

$ pip install dataspoc-pipe dataspoc-lens
$ dataspoc-pipe add my-postgres
$ dataspoc-pipe run my-postgres
$ dataspoc-lens ask "top customers by revenue"