duckdbsnowflakecost-reductiondata-warehouse

We Replaced Our $8k/month Snowflake with DuckDB and Parquet

Michael San Martim · 2026-04-23

This is the story of how we went from $8,000/month on Snowflake to $0/month — without losing anything that mattered to our team of 12 analysts.

The Problem

Our Snowflake bill was growing 20% month-over-month. We had:

2.3 TB of data across 47 tables
12 analysts running queries
90% of queries were simple: aggregations, filters, joins on 3-4 tables
Average query scanned less than 50 GB
The warehouse was idle 18 hours/day

We were paying enterprise data warehouse prices for workloads that a laptop could handle.

The Discovery

Parquet files on S3 + DuckDB = a free data warehouse for our scale.

DuckDB can:

Read Parquet directly from S3 (no COPY INTO, no staging)
Execute complex SQL with joins, window functions, CTEs
Process 50 GB in seconds on a single machine
Cache results for repeated queries

The missing piece was tooling. We needed:

Table discovery (what’s in the bucket?)
Schema management (what columns exist?)
A query interface for analysts
AI integration for business users

DataSpoc Lens provided all of that.

The Migration

Step 1: Export from Snowflake to Parquet

We already had our raw data in S3 (Snowflake was loading from there). For curated tables, we exported:

pip install dataspoc-pipe

# We used Pipe to extract our Snowflake curated tables to Parquet
dataspoc-pipe add snowflake \
  --account xy12345.us-east-1 \
  --database ANALYTICS \
  --schema CURATED \
  --tables revenue,customers,products,orders,sessions \
  --destination s3://company-lake

One-time migration took 15 minutes for 2.3 TB.

Step 2: Point Lens at the Bucket

pip install dataspoc-lens

dataspoc-lens add-bucket s3://company-lake
dataspoc-lens discover

Output:

Discovered 47 tables in s3://company-lake:
  raw/postgres/orders         (1.2M rows, 340 MB)
  raw/postgres/customers      (89K rows, 12 MB)
  raw/postgres/products       (2.4K rows, 1.1 MB)
  curated/finance/revenue     (4.2M rows, 890 MB)
  curated/product/sessions    (12M rows, 2.1 GB)
  ...

Step 3: Query Like Snowflake

The SQL is identical. Every query we had in Snowflake worked in DuckDB without modification:

Snowflake:

-- snowflake worksheet
SELECT
    DATE_TRUNC('month', order_date) AS month,
    product_category,
    SUM(amount) AS revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM analytics.curated.revenue
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, 2;

DataSpoc Lens (same SQL):

dataspoc-lens query "
SELECT
    DATE_TRUNC('month', order_date) AS month,
    product_category,
    SUM(amount) AS revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM curated.finance.revenue
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, 2
"

Same results. Zero cost. 1.2 seconds instead of 3.4 seconds (Snowflake XS warehouse).

Step 4: Give Analysts an Interactive Shell

dataspoc-lens shell

DataSpoc Lens Shell (DuckDB 0.10.1)
Connected to: s3://company-lake (47 tables)
Type .tables to list, .schema <table> to inspect, .quit to exit

lens> .tables
┌─────────────────────────────────┬──────────┬─────────┐
│ table                           │ rows     │ size    │
├─────────────────────────────────┼──────────┼─────────┤
│ curated.finance.revenue         │ 4.2M     │ 890 MB  │
│ curated.product.sessions        │ 12M      │ 2.1 GB  │
│ raw.postgres.orders             │ 1.2M     │ 340 MB  │
│ raw.postgres.customers          │ 89K      │ 12 MB   │
│ ...                             │          │         │
└─────────────────────────────────┴──────────┴─────────┘

lens> SELECT COUNT(*) FROM curated.finance.revenue WHERE order_date >= '2024-01-01';
┌──────────┐
│ count    │
├──────────┤
│ 1847293  │
└──────────┘
(0.4s)

Analysts who knew SQL were productive immediately. No retraining.

Step 5: AI Ask for Business Users

For the CEO and PMs who don’t write SQL:

dataspoc-lens ask "What was our MRR growth last quarter?"

Based on curated.finance.revenue:

MRR grew from $1.23M (Jul 2024) to $1.47M (Sep 2024), a 19.5% increase
over Q3. The strongest month was September (+8.2% MoM), driven by
Enterprise plan upgrades.

SQL used:
  SELECT DATE_TRUNC('month', order_date) AS month,
         SUM(amount) AS mrr
  FROM curated.finance.revenue
  WHERE order_date >= '2024-07-01' AND order_date < '2024-10-01'
    AND revenue_type = 'recurring'
  GROUP BY 1 ORDER BY 1

The AI shows its work — the exact SQL query — so analysts can verify.

Performance Comparison

We benchmarked our 20 most common queries:

Query Type	Snowflake (XS)	DataSpoc Lens (DuckDB)
Simple aggregation (1 table)	1.8s	0.3s
Join 2 tables + group by	3.2s	1.1s
Window function over 1M rows	4.1s	1.8s
Full table scan 12M rows	6.7s	4.2s
Complex CTE with 4 joins	8.3s	3.9s

DuckDB was faster for every single query at our scale. The secret: no network round-trip to a cloud warehouse, Parquet’s columnar format means DuckDB only reads the columns it needs, and S3 reads are parallelized.

The Cost Comparison

Cost Center	Snowflake	DataSpoc Lens
Compute (warehouse)	$6,200/mo	$0
Storage (managed)	$1,800/mo	$0 (already paying for S3)
S3 storage	$47/mo	$47/mo (same data)
S3 GET requests	—	$12/mo
DataSpoc Lens license	—	$0 (open source)
Total	$8,047/mo	$59/mo

Annual savings: $95,856.

What We Gained

Beyond cost savings:

No warehouse management. No sizing, no auto-suspend tuning, no credit monitoring.
Instant start. No cold-start warehouse spin-up (15-45 seconds on Snowflake).
Local development. Analysts can query the same data locally with dataspoc-lens shell.
AI-native. MCP server means Claude, GPT, and internal agents query our data directly.
Portable. Not locked into any vendor. Parquet is an open format.

What We Lost (and honest caveats)

This approach does NOT replace Snowflake if you have:

Requirement	Snowflake	DataSpoc Lens
50+ concurrent analysts	Handles it	Single-user per process
Petabyte-scale data	Distributed compute	Single-machine (limit ~100 GB effective)
Fine-grained RBAC	Native roles + policies	IAM-based (bucket level)
Time travel / versioning	Built-in	You manage Parquet versions
Semi-structured (JSON)	VARIANT type	DuckDB JSON functions (works, less elegant)
Governance / lineage	Native + partners	DIY
Data sharing	Secure shares	Share the bucket (less governed)

Our rule of thumb: If your data fits on a single machine’s memory (up to ~100 GB actively queried), you probably don’t need a cloud warehouse. That covers 80% of startups and mid-market companies.

The Migration Playbook

For teams considering this:

Audit your Snowflake usage. Run ACCOUNT_USAGE.QUERY_HISTORY — check data volumes and concurrency.
If 90% of queries scan < 50 GB: you’re a candidate.
Export curated tables to Parquet on S3. Use DataSpoc Pipe or COPY INTO with Parquet format.
Set up Lens. pip install dataspoc-lens && dataspoc-lens add-bucket s3://your-bucket
Run your top 20 queries. Validate correctness and performance.
Migrate analysts gradually. Keep Snowflake alive for 1 month as fallback.
Turn off the warehouse.

# The full migration in 4 commands
pip install dataspoc-pipe dataspoc-lens

dataspoc-pipe add snowflake --account YOUR_ACCOUNT --tables ALL --destination s3://your-lake
dataspoc-pipe run

dataspoc-lens add-bucket s3://your-lake
dataspoc-lens shell  # You're done

One Year Later

12 months after the migration:

Saved $95K in Snowflake costs
Query performance improved (no cold starts)
Analysts are happier (instant shell, AI Ask)
Zero incidents related to the migration
Added MCP server so product bots query data autonomously

The $8K/month was buying us things we didn’t need. For teams at our scale, the cloud data warehouse is a solved problem — and the solution is free.

← Back to blog