startupdata-lakecost-reductiontutorialgetting-started

The $0 Data Lake for Startups: DataSpoc + S3 in 30 Minutes

Michael San Martim · 2026-04-28

You are the founding engineer at a seed-stage startup. Your CEO asks “how many users signed up last week?” and you SSH into the production server, open psql, write a query, paste the result into Slack. Next week the question is “what is our MRR from Stripe?” and now you are juggling two data sources.

This post builds a complete data platform in 30 minutes for under $5/month. You will ingest from Postgres and Stripe, query with SQL and AI, and connect Claude as an agent. By the end, anyone on your team can ask questions about your data without bugging you.

What You Will Build

Postgres ──→ [Pipe] ──→ S3 Bucket ──→ [Lens] ──→ SQL Shell
Stripe   ──→ [Pipe] ──┘              │          ├──→ AI Ask
                                      │          ├──→ Jupyter
                                      │          └──→ MCP → Claude

Pipe extracts from Postgres and Stripe, writes Parquet to S3
Lens queries the Parquet files with DuckDB
MCP connects Claude (or Cursor, Windsurf) to your data lake

Total cost: S3 storage (~$2/month for startup-scale data) + DataSpoc ($0, open source).

Minute 0-5: Install Everything

# Install both tools
pip install dataspoc-pipe dataspoc-lens

# Install the Singer taps you need
dataspoc-pipe add tap-postgres
dataspoc-pipe add tap-stripe

# Verify
dataspoc-pipe --version
dataspoc-lens --version

Create an S3 bucket for your data lake:

aws s3 mb s3://mycompany-data-lake --region us-east-1

Minute 5-12: Ingest from Postgres

Create the pipeline config:

source:
  tap: tap-postgres
  config:
    host: "${DATABASE_HOST}"
    port: 5432
    database: "${DATABASE_NAME}"
    user: "${DATABASE_READONLY_USER}"
    password: "${DATABASE_READONLY_PASSWORD}"

    tables:
      - name: users
        replication_method: incremental
        replication_key: updated_at

      - name: orders
        replication_method: incremental
        replication_key: updated_at

      - name: subscriptions
        replication_method: incremental
        replication_key: updated_at

target:
  bucket: "s3://mycompany-data-lake"
  prefix: "raw/postgres"
  format: parquet

Create a read-only database user (never use your app’s credentials for ETL):

-- Run once on your Postgres database
CREATE USER dataspoc_readonly WITH PASSWORD 'secure-random-password';
GRANT CONNECT ON DATABASE myapp TO dataspoc_readonly;
GRANT USAGE ON SCHEMA public TO dataspoc_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dataspoc_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO dataspoc_readonly;

Set the environment variables and run:

export DATABASE_HOST="your-rds-endpoint.amazonaws.com"
export DATABASE_NAME="myapp"
export DATABASE_READONLY_USER="dataspoc_readonly"
export DATABASE_READONLY_PASSWORD="secure-random-password"

dataspoc-pipe run postgres-app

[09:05:01] Starting pipeline: postgres-app
[09:05:02] Extracting: users (2,340 rows)
[09:05:03] → raw/postgres/users/users_20260428.parquet (180 KB)
[09:05:03] Extracting: orders (8,920 rows)
[09:05:05] → raw/postgres/orders/orders_20260428.parquet (720 KB)
[09:05:05] Extracting: subscriptions (1,890 rows)
[09:05:06] → raw/postgres/subscriptions/subscriptions_20260428.parquet (95 KB)
[09:05:06] Pipeline complete: 3 tables, 13,150 rows

Minute 12-20: Ingest from Stripe

source:
  tap: tap-stripe
  config:
    api_key: "${STRIPE_API_KEY}"
    # Use a restricted key with read-only access
    start_date: "2025-01-01"

    streams:
      - charges
      - customers
      - invoices
      - subscriptions
      - balance_transactions

target:
  bucket: "s3://mycompany-data-lake"
  prefix: "raw/stripe"
  format: parquet

Get a restricted API key from Stripe Dashboard (Settings > API Keys > Create Restricted Key with read-only access):

export STRIPE_API_KEY="rk_live_..."

dataspoc-pipe run stripe

[09:12:01] Starting pipeline: stripe
[09:12:03] Extracting: charges (4,560 records)
[09:12:08] Extracting: customers (2,120 records)
[09:12:11] Extracting: invoices (3,890 records)
[09:12:15] Extracting: subscriptions (1,450 records)
[09:12:18] Extracting: balance_transactions (12,340 records)
[09:12:22] Pipeline complete: 5 streams, 24,360 records

Minute 20-25: Connect Lens and Query

dataspoc-lens add-bucket s3://mycompany-data-lake --name datalake
dataspoc-lens tables

raw.postgres.users
raw.postgres.orders
raw.postgres.subscriptions
raw.stripe.charges
raw.stripe.customers
raw.stripe.invoices
raw.stripe.subscriptions
raw.stripe.balance_transactions

Eight tables from two sources. Now query:

# How many users signed up last week?
dataspoc-lens query "
  SELECT COUNT(*) AS signups
  FROM raw.postgres.users
  WHERE created_at >= CURRENT_DATE - INTERVAL 7 DAY
"

# What is our MRR?
dataspoc-lens query "
  SELECT
    SUM(plan_amount / 100.0) AS mrr_dollars
  FROM raw.stripe.subscriptions
  WHERE status = 'active'
"

# Revenue by day this month
dataspoc-lens query "
  SELECT
    DATE_TRUNC('day', created::TIMESTAMP) AS day,
    SUM(amount / 100.0) AS revenue
  FROM raw.stripe.charges
  WHERE status = 'succeeded'
    AND created::TIMESTAMP >= DATE_TRUNC('month', CURRENT_DATE)
  GROUP BY 1
  ORDER BY 1
"

Or use AI (set up a provider or use Ollama for free):

dataspoc-lens setup-ai  # installs Ollama + local model

dataspoc-lens ask "what is our MRR?"
dataspoc-lens ask "how many users signed up each day this week?"
dataspoc-lens ask "what is the average revenue per user?"

Minute 25-30: Connect Claude via MCP

Add the MCP server config for Claude Desktop or any MCP-compatible tool:

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "AWS_PROFILE": "default"
      }
    }
  }
}

Now in Claude Desktop, type:

What tables do we have in the data lake?

Claude calls the MCP server, discovers your tables, and responds. Follow up with:

What is our MRR and how has it changed month over month?

Claude writes SQL, executes it against your Parquet files, and gives you a formatted answer with analysis. Your CEO can now ask Claude directly instead of pinging you on Slack.

Schedule It

Add a cron job to keep data fresh:

crontab -e

# Extract from Postgres every hour
0 * * * * cd /opt/dataspoc && dataspoc-pipe run postgres-app >> /var/log/dataspoc/postgres.log 2>&1

# Extract from Stripe every 6 hours (API rate limits)
0 */6 * * * cd /opt/dataspoc && dataspoc-pipe run stripe >> /var/log/dataspoc/stripe.log 2>&1

# Refresh Lens cache after extractions
5 * * * * dataspoc-lens cache refresh-stale >> /var/log/dataspoc/cache.log 2>&1

Cost Breakdown

Component	Monthly Cost	Notes
DataSpoc Pipe	$0	Open source
DataSpoc Lens	$0	Open source
S3 storage (1 GB)	$0.023	Parquet is highly compressed
S3 requests	~$1-2	PUT/GET for extraction and queries
Ollama (local AI)	$0	Runs on your machine
Total	< $3/month

Compare that to alternatives:

Alternative	Monthly Cost
Snowflake	$400+ (minimum warehouse)
BigQuery	$0 (free tier) then $5/TB queried
Fivetran + Snowflake	$500+
Airbyte Cloud + Redshift	$400+
DataSpoc + S3	< $5

What You Get

For under $5/month, your startup now has:

Automated data ingestion from Postgres and Stripe (hourly)
SQL analytics over cloud Parquet (no warehouse to manage)
AI-powered queries (free with Ollama, or use Claude/OpenAI)
MCP integration for Claude Desktop, Cursor, Windsurf
Incremental extraction (only new data is fetched)
No infrastructure to manage (S3 + CLI tools)

Growth Path: When to Scale

This setup handles a startup from seed through Series A comfortably. Here is when to consider upgrades:

Milestone	Action
Data > 10 GB	Add more S3 storage ($0.23/month per 10 GB)
Need transforms	Add SQL transform files (see our dbt alternative post)
5+ data sources	Add more Pipe pipelines (same pattern)
Need dashboards	Connect Metabase/Superset to DuckDB via Lens
Team > 5 analysts	Consider Lens cache on shared server
Data > 1 TB	Consider partitioning strategy, still works with DuckDB
Need real-time	That is when you look at Kafka + streaming — not yet

The key insight: you can always migrate to a heavier stack later. Parquet on S3 is the universal data format. If you outgrow DataSpoc, your data works with Spark, Trino, Athena, or Snowflake external tables.

Adding More Sources Later

DataSpoc Pipe supports 400+ Singer taps. Add any source in minutes:

# HubSpot CRM
dataspoc-pipe add tap-hubspot

# Google Analytics
dataspoc-pipe add tap-google-analytics

# Salesforce
dataspoc-pipe add tap-salesforce

# MySQL (if you have a second database)
dataspoc-pipe add tap-mysql

# Google Sheets (for manual data)
dataspoc-pipe add tap-google-sheets

Each source gets its own pipeline YAML and runs on its own schedule. All data lands in the same S3 bucket, queryable through the same Lens interface.

The Complete Setup Script

Save this and run it to set up everything at once:

#!/bin/bash
# setup-data-lake.sh -- Complete startup data lake setup
set -e

echo "=== DataSpoc Data Lake Setup ==="

# Install tools
echo "[1/6] Installing DataSpoc..."
pip install dataspoc-pipe dataspoc-lens

# Install taps
echo "[2/6] Installing source connectors..."
dataspoc-pipe add tap-postgres
dataspoc-pipe add tap-stripe

# Create S3 bucket
echo "[3/6] Creating S3 bucket..."
aws s3 mb s3://mycompany-data-lake --region us-east-1

# Run initial extraction
echo "[4/6] Running initial extraction..."
dataspoc-pipe run postgres-app
dataspoc-pipe run stripe

# Configure Lens
echo "[5/6] Configuring query engine..."
dataspoc-lens add-bucket s3://mycompany-data-lake --name datalake

# Setup local AI
echo "[6/6] Setting up AI queries..."
dataspoc-lens setup-ai

echo ""
echo "=== Setup Complete ==="
echo "Tables available:"
dataspoc-lens tables
echo ""
echo "Try: dataspoc-lens ask 'how many users signed up this week?'"

Thirty minutes. Under $5/month. A complete data platform that grows with you. Stop SSHing into production to answer business questions.

← Back to blog