The $0 Data Lake for Startups: DataSpoc + S3 in 30 Minutes
You are the founding engineer at a seed-stage startup. Your CEO asks “how many users signed up last week?” and you SSH into the production server, open psql, write a query, paste the result into Slack. Next week the question is “what is our MRR from Stripe?” and now you are juggling two data sources.
This post builds a complete data platform in 30 minutes for under $5/month. You will ingest from Postgres and Stripe, query with SQL and AI, and connect Claude as an agent. By the end, anyone on your team can ask questions about your data without bugging you.
What You Will Build
Postgres ──→ [Pipe] ──→ S3 Bucket ──→ [Lens] ──→ SQL ShellStripe ──→ [Pipe] ──┘ │ ├──→ AI Ask │ ├──→ Jupyter │ └──→ MCP → Claude- Pipe extracts from Postgres and Stripe, writes Parquet to S3
- Lens queries the Parquet files with DuckDB
- MCP connects Claude (or Cursor, Windsurf) to your data lake
Total cost: S3 storage (~$2/month for startup-scale data) + DataSpoc ($0, open source).
Minute 0-5: Install Everything
# Install both toolspip install dataspoc-pipe dataspoc-lens
# Install the Singer taps you needdataspoc-pipe add tap-postgresdataspoc-pipe add tap-stripe
# Verifydataspoc-pipe --versiondataspoc-lens --versionCreate an S3 bucket for your data lake:
aws s3 mb s3://mycompany-data-lake --region us-east-1Minute 5-12: Ingest from Postgres
Create the pipeline config:
source: tap: tap-postgres config: host: "${DATABASE_HOST}" port: 5432 database: "${DATABASE_NAME}" user: "${DATABASE_READONLY_USER}" password: "${DATABASE_READONLY_PASSWORD}"
tables: - name: users replication_method: incremental replication_key: updated_at
- name: orders replication_method: incremental replication_key: updated_at
- name: subscriptions replication_method: incremental replication_key: updated_at
target: bucket: "s3://mycompany-data-lake" prefix: "raw/postgres" format: parquetCreate a read-only database user (never use your app’s credentials for ETL):
-- Run once on your Postgres databaseCREATE USER dataspoc_readonly WITH PASSWORD 'secure-random-password';GRANT CONNECT ON DATABASE myapp TO dataspoc_readonly;GRANT USAGE ON SCHEMA public TO dataspoc_readonly;GRANT SELECT ON ALL TABLES IN SCHEMA public TO dataspoc_readonly;ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO dataspoc_readonly;Set the environment variables and run:
export DATABASE_HOST="your-rds-endpoint.amazonaws.com"export DATABASE_NAME="myapp"export DATABASE_READONLY_USER="dataspoc_readonly"export DATABASE_READONLY_PASSWORD="secure-random-password"
dataspoc-pipe run postgres-app[09:05:01] Starting pipeline: postgres-app[09:05:02] Extracting: users (2,340 rows)[09:05:03] → raw/postgres/users/users_20260428.parquet (180 KB)[09:05:03] Extracting: orders (8,920 rows)[09:05:05] → raw/postgres/orders/orders_20260428.parquet (720 KB)[09:05:05] Extracting: subscriptions (1,890 rows)[09:05:06] → raw/postgres/subscriptions/subscriptions_20260428.parquet (95 KB)[09:05:06] Pipeline complete: 3 tables, 13,150 rowsMinute 12-20: Ingest from Stripe
source: tap: tap-stripe config: api_key: "${STRIPE_API_KEY}" # Use a restricted key with read-only access start_date: "2025-01-01"
streams: - charges - customers - invoices - subscriptions - balance_transactions
target: bucket: "s3://mycompany-data-lake" prefix: "raw/stripe" format: parquetGet a restricted API key from Stripe Dashboard (Settings > API Keys > Create Restricted Key with read-only access):
export STRIPE_API_KEY="rk_live_..."
dataspoc-pipe run stripe[09:12:01] Starting pipeline: stripe[09:12:03] Extracting: charges (4,560 records)[09:12:08] Extracting: customers (2,120 records)[09:12:11] Extracting: invoices (3,890 records)[09:12:15] Extracting: subscriptions (1,450 records)[09:12:18] Extracting: balance_transactions (12,340 records)[09:12:22] Pipeline complete: 5 streams, 24,360 recordsMinute 20-25: Connect Lens and Query
Register the bucket with Lens:
dataspoc-lens add-bucket s3://mycompany-data-lake --name datalakedataspoc-lens tablesraw.postgres.usersraw.postgres.ordersraw.postgres.subscriptionsraw.stripe.chargesraw.stripe.customersraw.stripe.invoicesraw.stripe.subscriptionsraw.stripe.balance_transactionsEight tables from two sources. Now query:
# How many users signed up last week?dataspoc-lens query " SELECT COUNT(*) AS signups FROM raw.postgres.users WHERE created_at >= CURRENT_DATE - INTERVAL 7 DAY"
# What is our MRR?dataspoc-lens query " SELECT SUM(plan_amount / 100.0) AS mrr_dollars FROM raw.stripe.subscriptions WHERE status = 'active'"
# Revenue by day this monthdataspoc-lens query " SELECT DATE_TRUNC('day', created::TIMESTAMP) AS day, SUM(amount / 100.0) AS revenue FROM raw.stripe.charges WHERE status = 'succeeded' AND created::TIMESTAMP >= DATE_TRUNC('month', CURRENT_DATE) GROUP BY 1 ORDER BY 1"Or use AI (set up a provider or use Ollama for free):
dataspoc-lens setup-ai # installs Ollama + local model
dataspoc-lens ask "what is our MRR?"dataspoc-lens ask "how many users signed up each day this week?"dataspoc-lens ask "what is the average revenue per user?"Minute 25-30: Connect Claude via MCP
Add the MCP server config for Claude Desktop or any MCP-compatible tool:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "AWS_PROFILE": "default" } } }}Now in Claude Desktop, type:
What tables do we have in the data lake?Claude calls the MCP server, discovers your tables, and responds. Follow up with:
What is our MRR and how has it changed month over month?Claude writes SQL, executes it against your Parquet files, and gives you a formatted answer with analysis. Your CEO can now ask Claude directly instead of pinging you on Slack.
Schedule It
Add a cron job to keep data fresh:
crontab -e# Extract from Postgres every hour0 * * * * cd /opt/dataspoc && dataspoc-pipe run postgres-app >> /var/log/dataspoc/postgres.log 2>&1
# Extract from Stripe every 6 hours (API rate limits)0 */6 * * * cd /opt/dataspoc && dataspoc-pipe run stripe >> /var/log/dataspoc/stripe.log 2>&1
# Refresh Lens cache after extractions5 * * * * dataspoc-lens cache refresh-stale >> /var/log/dataspoc/cache.log 2>&1Cost Breakdown
| Component | Monthly Cost | Notes |
|---|---|---|
| DataSpoc Pipe | $0 | Open source |
| DataSpoc Lens | $0 | Open source |
| S3 storage (1 GB) | $0.023 | Parquet is highly compressed |
| S3 requests | ~$1-2 | PUT/GET for extraction and queries |
| Ollama (local AI) | $0 | Runs on your machine |
| Total | < $3/month |
Compare that to alternatives:
| Alternative | Monthly Cost |
|---|---|
| Snowflake | $400+ (minimum warehouse) |
| BigQuery | $0 (free tier) then $5/TB queried |
| Fivetran + Snowflake | $500+ |
| Airbyte Cloud + Redshift | $400+ |
| DataSpoc + S3 | < $5 |
What You Get
For under $5/month, your startup now has:
- Automated data ingestion from Postgres and Stripe (hourly)
- SQL analytics over cloud Parquet (no warehouse to manage)
- AI-powered queries (free with Ollama, or use Claude/OpenAI)
- MCP integration for Claude Desktop, Cursor, Windsurf
- Incremental extraction (only new data is fetched)
- No infrastructure to manage (S3 + CLI tools)
Growth Path: When to Scale
This setup handles a startup from seed through Series A comfortably. Here is when to consider upgrades:
| Milestone | Action |
|---|---|
| Data > 10 GB | Add more S3 storage ($0.23/month per 10 GB) |
| Need transforms | Add SQL transform files (see our dbt alternative post) |
| 5+ data sources | Add more Pipe pipelines (same pattern) |
| Need dashboards | Connect Metabase/Superset to DuckDB via Lens |
| Team > 5 analysts | Consider Lens cache on shared server |
| Data > 1 TB | Consider partitioning strategy, still works with DuckDB |
| Need real-time | That is when you look at Kafka + streaming — not yet |
The key insight: you can always migrate to a heavier stack later. Parquet on S3 is the universal data format. If you outgrow DataSpoc, your data works with Spark, Trino, Athena, or Snowflake external tables.
Adding More Sources Later
DataSpoc Pipe supports 400+ Singer taps. Add any source in minutes:
# HubSpot CRMdataspoc-pipe add tap-hubspot
# Google Analyticsdataspoc-pipe add tap-google-analytics
# Salesforcedataspoc-pipe add tap-salesforce
# MySQL (if you have a second database)dataspoc-pipe add tap-mysql
# Google Sheets (for manual data)dataspoc-pipe add tap-google-sheetsEach source gets its own pipeline YAML and runs on its own schedule. All data lands in the same S3 bucket, queryable through the same Lens interface.
The Complete Setup Script
Save this and run it to set up everything at once:
#!/bin/bash# setup-data-lake.sh -- Complete startup data lake setupset -e
echo "=== DataSpoc Data Lake Setup ==="
# Install toolsecho "[1/6] Installing DataSpoc..."pip install dataspoc-pipe dataspoc-lens
# Install tapsecho "[2/6] Installing source connectors..."dataspoc-pipe add tap-postgresdataspoc-pipe add tap-stripe
# Create S3 bucketecho "[3/6] Creating S3 bucket..."aws s3 mb s3://mycompany-data-lake --region us-east-1
# Run initial extractionecho "[4/6] Running initial extraction..."dataspoc-pipe run postgres-appdataspoc-pipe run stripe
# Configure Lensecho "[5/6] Configuring query engine..."dataspoc-lens add-bucket s3://mycompany-data-lake --name datalake
# Setup local AIecho "[6/6] Setting up AI queries..."dataspoc-lens setup-ai
echo ""echo "=== Setup Complete ==="echo "Tables available:"dataspoc-lens tablesecho ""echo "Try: dataspoc-lens ask 'how many users signed up this week?'"Thirty minutes. Under $5/month. A complete data platform that grows with you. Stop SSHing into production to answer business questions.