startupdata-lakecost-reductiontutorialgetting-started

The $0 Data Lake for Startups: DataSpoc + S3 in 30 Minutes

Michael San Martim · 2026-04-28

You are the founding engineer at a seed-stage startup. Your CEO asks “how many users signed up last week?” and you SSH into the production server, open psql, write a query, paste the result into Slack. Next week the question is “what is our MRR from Stripe?” and now you are juggling two data sources.

This post builds a complete data platform in 30 minutes for under $5/month. You will ingest from Postgres and Stripe, query with SQL and AI, and connect Claude as an agent. By the end, anyone on your team can ask questions about your data without bugging you.

What You Will Build

Postgres ──→ [Pipe] ──→ S3 Bucket ──→ [Lens] ──→ SQL Shell
Stripe ──→ [Pipe] ──┘ │ ├──→ AI Ask
│ ├──→ Jupyter
│ └──→ MCP → Claude
  • Pipe extracts from Postgres and Stripe, writes Parquet to S3
  • Lens queries the Parquet files with DuckDB
  • MCP connects Claude (or Cursor, Windsurf) to your data lake

Total cost: S3 storage (~$2/month for startup-scale data) + DataSpoc ($0, open source).

Minute 0-5: Install Everything

Terminal window
# Install both tools
pip install dataspoc-pipe dataspoc-lens
# Install the Singer taps you need
dataspoc-pipe add tap-postgres
dataspoc-pipe add tap-stripe
# Verify
dataspoc-pipe --version
dataspoc-lens --version

Create an S3 bucket for your data lake:

Terminal window
aws s3 mb s3://mycompany-data-lake --region us-east-1

Minute 5-12: Ingest from Postgres

Create the pipeline config:

pipelines/postgres-app.yaml
source:
tap: tap-postgres
config:
host: "${DATABASE_HOST}"
port: 5432
database: "${DATABASE_NAME}"
user: "${DATABASE_READONLY_USER}"
password: "${DATABASE_READONLY_PASSWORD}"
tables:
- name: users
replication_method: incremental
replication_key: updated_at
- name: orders
replication_method: incremental
replication_key: updated_at
- name: subscriptions
replication_method: incremental
replication_key: updated_at
target:
bucket: "s3://mycompany-data-lake"
prefix: "raw/postgres"
format: parquet

Create a read-only database user (never use your app’s credentials for ETL):

-- Run once on your Postgres database
CREATE USER dataspoc_readonly WITH PASSWORD 'secure-random-password';
GRANT CONNECT ON DATABASE myapp TO dataspoc_readonly;
GRANT USAGE ON SCHEMA public TO dataspoc_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dataspoc_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO dataspoc_readonly;

Set the environment variables and run:

Terminal window
export DATABASE_HOST="your-rds-endpoint.amazonaws.com"
export DATABASE_NAME="myapp"
export DATABASE_READONLY_USER="dataspoc_readonly"
export DATABASE_READONLY_PASSWORD="secure-random-password"
dataspoc-pipe run postgres-app
[09:05:01] Starting pipeline: postgres-app
[09:05:02] Extracting: users (2,340 rows)
[09:05:03] → raw/postgres/users/users_20260428.parquet (180 KB)
[09:05:03] Extracting: orders (8,920 rows)
[09:05:05] → raw/postgres/orders/orders_20260428.parquet (720 KB)
[09:05:05] Extracting: subscriptions (1,890 rows)
[09:05:06] → raw/postgres/subscriptions/subscriptions_20260428.parquet (95 KB)
[09:05:06] Pipeline complete: 3 tables, 13,150 rows

Minute 12-20: Ingest from Stripe

pipelines/stripe.yaml
source:
tap: tap-stripe
config:
api_key: "${STRIPE_API_KEY}"
# Use a restricted key with read-only access
start_date: "2025-01-01"
streams:
- charges
- customers
- invoices
- subscriptions
- balance_transactions
target:
bucket: "s3://mycompany-data-lake"
prefix: "raw/stripe"
format: parquet

Get a restricted API key from Stripe Dashboard (Settings > API Keys > Create Restricted Key with read-only access):

Terminal window
export STRIPE_API_KEY="rk_live_..."
dataspoc-pipe run stripe
[09:12:01] Starting pipeline: stripe
[09:12:03] Extracting: charges (4,560 records)
[09:12:08] Extracting: customers (2,120 records)
[09:12:11] Extracting: invoices (3,890 records)
[09:12:15] Extracting: subscriptions (1,450 records)
[09:12:18] Extracting: balance_transactions (12,340 records)
[09:12:22] Pipeline complete: 5 streams, 24,360 records

Minute 20-25: Connect Lens and Query

Register the bucket with Lens:

Terminal window
dataspoc-lens add-bucket s3://mycompany-data-lake --name datalake
dataspoc-lens tables
raw.postgres.users
raw.postgres.orders
raw.postgres.subscriptions
raw.stripe.charges
raw.stripe.customers
raw.stripe.invoices
raw.stripe.subscriptions
raw.stripe.balance_transactions

Eight tables from two sources. Now query:

Terminal window
# How many users signed up last week?
dataspoc-lens query "
SELECT COUNT(*) AS signups
FROM raw.postgres.users
WHERE created_at >= CURRENT_DATE - INTERVAL 7 DAY
"
# What is our MRR?
dataspoc-lens query "
SELECT
SUM(plan_amount / 100.0) AS mrr_dollars
FROM raw.stripe.subscriptions
WHERE status = 'active'
"
# Revenue by day this month
dataspoc-lens query "
SELECT
DATE_TRUNC('day', created::TIMESTAMP) AS day,
SUM(amount / 100.0) AS revenue
FROM raw.stripe.charges
WHERE status = 'succeeded'
AND created::TIMESTAMP >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY 1
ORDER BY 1
"

Or use AI (set up a provider or use Ollama for free):

Terminal window
dataspoc-lens setup-ai # installs Ollama + local model
dataspoc-lens ask "what is our MRR?"
dataspoc-lens ask "how many users signed up each day this week?"
dataspoc-lens ask "what is the average revenue per user?"

Minute 25-30: Connect Claude via MCP

Add the MCP server config for Claude Desktop or any MCP-compatible tool:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"AWS_PROFILE": "default"
}
}
}
}

Now in Claude Desktop, type:

What tables do we have in the data lake?

Claude calls the MCP server, discovers your tables, and responds. Follow up with:

What is our MRR and how has it changed month over month?

Claude writes SQL, executes it against your Parquet files, and gives you a formatted answer with analysis. Your CEO can now ask Claude directly instead of pinging you on Slack.

Schedule It

Add a cron job to keep data fresh:

Terminal window
crontab -e
# Extract from Postgres every hour
0 * * * * cd /opt/dataspoc && dataspoc-pipe run postgres-app >> /var/log/dataspoc/postgres.log 2>&1
# Extract from Stripe every 6 hours (API rate limits)
0 */6 * * * cd /opt/dataspoc && dataspoc-pipe run stripe >> /var/log/dataspoc/stripe.log 2>&1
# Refresh Lens cache after extractions
5 * * * * dataspoc-lens cache refresh-stale >> /var/log/dataspoc/cache.log 2>&1

Cost Breakdown

ComponentMonthly CostNotes
DataSpoc Pipe$0Open source
DataSpoc Lens$0Open source
S3 storage (1 GB)$0.023Parquet is highly compressed
S3 requests~$1-2PUT/GET for extraction and queries
Ollama (local AI)$0Runs on your machine
Total< $3/month

Compare that to alternatives:

AlternativeMonthly Cost
Snowflake$400+ (minimum warehouse)
BigQuery$0 (free tier) then $5/TB queried
Fivetran + Snowflake$500+
Airbyte Cloud + Redshift$400+
DataSpoc + S3< $5

What You Get

For under $5/month, your startup now has:

  • Automated data ingestion from Postgres and Stripe (hourly)
  • SQL analytics over cloud Parquet (no warehouse to manage)
  • AI-powered queries (free with Ollama, or use Claude/OpenAI)
  • MCP integration for Claude Desktop, Cursor, Windsurf
  • Incremental extraction (only new data is fetched)
  • No infrastructure to manage (S3 + CLI tools)

Growth Path: When to Scale

This setup handles a startup from seed through Series A comfortably. Here is when to consider upgrades:

MilestoneAction
Data > 10 GBAdd more S3 storage ($0.23/month per 10 GB)
Need transformsAdd SQL transform files (see our dbt alternative post)
5+ data sourcesAdd more Pipe pipelines (same pattern)
Need dashboardsConnect Metabase/Superset to DuckDB via Lens
Team > 5 analystsConsider Lens cache on shared server
Data > 1 TBConsider partitioning strategy, still works with DuckDB
Need real-timeThat is when you look at Kafka + streaming — not yet

The key insight: you can always migrate to a heavier stack later. Parquet on S3 is the universal data format. If you outgrow DataSpoc, your data works with Spark, Trino, Athena, or Snowflake external tables.

Adding More Sources Later

DataSpoc Pipe supports 400+ Singer taps. Add any source in minutes:

Terminal window
# HubSpot CRM
dataspoc-pipe add tap-hubspot
# Google Analytics
dataspoc-pipe add tap-google-analytics
# Salesforce
dataspoc-pipe add tap-salesforce
# MySQL (if you have a second database)
dataspoc-pipe add tap-mysql
# Google Sheets (for manual data)
dataspoc-pipe add tap-google-sheets

Each source gets its own pipeline YAML and runs on its own schedule. All data lands in the same S3 bucket, queryable through the same Lens interface.

The Complete Setup Script

Save this and run it to set up everything at once:

#!/bin/bash
# setup-data-lake.sh -- Complete startup data lake setup
set -e
echo "=== DataSpoc Data Lake Setup ==="
# Install tools
echo "[1/6] Installing DataSpoc..."
pip install dataspoc-pipe dataspoc-lens
# Install taps
echo "[2/6] Installing source connectors..."
dataspoc-pipe add tap-postgres
dataspoc-pipe add tap-stripe
# Create S3 bucket
echo "[3/6] Creating S3 bucket..."
aws s3 mb s3://mycompany-data-lake --region us-east-1
# Run initial extraction
echo "[4/6] Running initial extraction..."
dataspoc-pipe run postgres-app
dataspoc-pipe run stripe
# Configure Lens
echo "[5/6] Configuring query engine..."
dataspoc-lens add-bucket s3://mycompany-data-lake --name datalake
# Setup local AI
echo "[6/6] Setting up AI queries..."
dataspoc-lens setup-ai
echo ""
echo "=== Setup Complete ==="
echo "Tables available:"
dataspoc-lens tables
echo ""
echo "Try: dataspoc-lens ask 'how many users signed up this week?'"

Thirty minutes. Under $5/month. A complete data platform that grows with you. Stop SSHing into production to answer business questions.

Recommended