duckdbsparkcomparisondata-lakeperformance

DuckDB vs Spark for Data Lake Queries: When Each Wins

Michael San Martim · 2026-04-24

DataSpoc Lens uses DuckDB to query Parquet files in cloud storage. Databricks uses Spark. Both can query data lakes. They are built for fundamentally different scales and use cases. This post gives you the data to choose the right one.

The Core Difference

DuckDB is an in-process analytical database. It runs inside your Python process, your CLI, or your laptop. Zero infrastructure. One process. One machine.

Spark is a distributed computation engine. It runs across a cluster of machines. Requires infrastructure: a cluster manager (YARN, Kubernetes, or Databricks), driver nodes, worker nodes, configuration.

DuckDB:
Your laptop → DuckDB (in-process) → Parquet in S3 → Results
Spark:
Your laptop → Spark Driver → Cluster Manager → N Worker Nodes → Parquet in S3 → Shuffle → Results

Setup Comparison

DuckDB (via DataSpoc Lens)

Terminal window
pip install dataspoc-lens
from dataspoc_lens import LensClient
lens = LensClient()
df = lens.query("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")
print(df)

Time to first query: under 30 seconds (install + run).

Spark (via PySpark)

Terminal window
pip install pyspark
# Also need: Java 8+, Hadoop config, S3 credentials config
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("data-lake-query") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
.config("spark.driver.memory", "4g") \
.config("spark.sql.parquet.enableVectorizedReader", "true") \
.getOrCreate()
df = spark.read.parquet("s3a://my-data-lake/curated/sales/")
df.createOrReplaceTempView("curated_sales")
result = spark.sql("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")
result.show()

Time to first query: 5-15 minutes (install Java, configure Hadoop, debug S3 auth, wait for Spark startup).

On Databricks, setup is faster (managed service), but you are paying $0.15-0.75 per DBU-hour from the moment you start a cluster.

Benchmark: Same Queries, Different Scales

We ran 5 queries across 4 dataset sizes on Parquet files in S3. DuckDB ran on a single machine (8 cores, 32GB RAM). Spark ran on a 3-node cluster (8 cores, 32GB RAM each = 24 cores, 96GB total).

Query 1: Simple Aggregation

SELECT region, SUM(amount) as total, COUNT(*) as orders
FROM sales
GROUP BY region
ORDER BY total DESC
Dataset SizeDuckDBSparkWinner
1M rows (200MB)0.3s8.2sDuckDB (27x)
10M rows (2GB)1.8s9.5sDuckDB (5x)
100M rows (20GB)14s18sDuckDB (1.3x)
1B rows (200GB)OOM45sSpark

Query 2: Join + Aggregation

SELECT c.segment, DATE_TRUNC('month', s.sale_date) as month,
SUM(s.amount) as revenue, COUNT(DISTINCT s.customer_id) as customers
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.segment, month
ORDER BY month, revenue DESC
Dataset SizeDuckDBSparkWinner
1M rows0.5s12sDuckDB (24x)
10M rows3.2s14sDuckDB (4x)
100M rows28s22sSpark (1.3x)
1B rowsOOM68sSpark

Query 3: Window Function

SELECT customer_id, sale_date, amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY sale_date) as running_total,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sale_date DESC) as recency_rank
FROM sales
WHERE sale_date >= '2025-01-01'
Dataset SizeDuckDBSparkWinner
1M rows0.8s15sDuckDB (19x)
10M rows6.5s18sDuckDB (3x)
100M rows52s35sSpark (1.5x)
1B rowsOOM120sSpark

Query 4: Full Table Scan with Complex Filter

SELECT * FROM sales
WHERE amount > 100
AND region IN ('NA', 'EU')
AND sale_date BETWEEN '2025-06-01' AND '2025-12-31'
AND product LIKE '%Pro%'
ORDER BY amount DESC
LIMIT 1000
Dataset SizeDuckDBSparkWinner
1M rows0.2s7sDuckDB (35x)
10M rows0.9s8sDuckDB (9x)
100M rows6s12sDuckDB (2x)
1B rows45s15sSpark (3x)

Query 5: Ad-Hoc Exploration

-- "What does this data look like?"
SELECT * FROM sales LIMIT 100
Dataset SizeDuckDBSparkWinner
Any size0.1s5-8sDuckDB (50-80x)

Spark’s overhead is constant: cluster initialization, job scheduling, task distribution. For ad-hoc exploration, DuckDB is always faster because there is no cluster to coordinate.

Cost Comparison

DuckDB (via DataSpoc Lens)

Software: $0 (open-source)
Compute: Your laptop (free) or a VM ($50-150/month)
Storage: S3/GCS/Azure ($0.02/GB/month)
Monthly cost for 500GB data lake: ~$60

Spark on Databricks

Software: Databricks DBU pricing ($0.15-0.75/DBU-hour)
Compute: Cluster instances (e.g., 3x i3.xlarge = $0.94/hour)
Storage: S3/GCS/Azure ($0.02/GB/month)
Cluster uptime: Even "serverless" has startup cost
Monthly cost for 500GB data lake with moderate query load:
Cluster: $200-800/month (depending on uptime)
DBUs: $150-500/month
Storage: $10/month
Total: ~$360-1,310/month

Spark on EMR

EMR fee: $0.015-0.27/hour per instance
EC2 instances: 3x m5.xlarge = $0.576/hour
Storage: $10/month
Monthly cost (8 hours/day usage): ~$300-500/month

When DuckDB Wins

1. Data Fits in Memory (< 100GB)

DuckDB processes data in-memory on a single machine. For datasets under 100GB (which covers the vast majority of company data lakes), it is faster than Spark because there is no cluster overhead.

from dataspoc_lens import LensClient
lens = LensClient()
# 50GB dataset? DuckDB handles it fine on a 64GB machine
df = lens.query("""
SELECT category, DATE_TRUNC('week', event_date) as week,
COUNT(*) as events, COUNT(DISTINCT user_id) as users
FROM curated_events
GROUP BY category, week
ORDER BY week DESC, events DESC
""")

2. Ad-Hoc Exploration and Development

When you are exploring data, writing queries iteratively, and checking results, DuckDB’s sub-second response time changes how you work:

# Exploration loop — instant feedback
lens.query("SELECT * FROM curated_sales LIMIT 5") # 0.1s
lens.query("SELECT DISTINCT region FROM curated_sales") # 0.2s
lens.query("SELECT region, COUNT(*) FROM curated_sales GROUP BY 1") # 0.3s
# vs. Spark: 8s, 8s, 9s for the same queries

3. CI/CD and Testing

Running analytical queries in CI pipelines requires fast startup and zero infrastructure:

# GitHub Actions — test your SQL transformations
- name: Test data transformations
run: |
pip install dataspoc-lens
python -c "
from dataspoc_lens import LensClient
lens = LensClient()
result = lens.query('SELECT COUNT(*) as cnt FROM curated_sales')
assert result['cnt'].iloc[0] > 0, 'Sales table is empty'
"

4. AI Agent Integration

DuckDB starts instantly, which is critical for AI agents that need to execute multiple queries in a conversation:

# Agent loop: 5 queries in sequence
# DuckDB: 0.3s + 0.2s + 0.5s + 0.3s + 0.4s = 1.7s total
# Spark: 8s + 8s + 9s + 8s + 9s = 42s total
# Users will not wait 42 seconds for each agent response

When Spark Wins

1. Data Exceeds Single-Machine Memory (> 200GB)

When your dataset is larger than what fits in RAM, Spark distributes the work across a cluster:

# 2TB dataset — Spark distributes across 20 nodes
spark.sql("""
SELECT user_id, SUM(amount) as lifetime_value
FROM massive_events -- 2TB, 10 billion rows
GROUP BY user_id
HAVING lifetime_value > 1000
""").write.parquet("s3://bucket/gold/high_value_users/")

2. Production ETL Pipelines

Spark’s fault tolerance (RDD lineage, speculative execution, task retry) makes it reliable for scheduled production jobs:

# Daily pipeline that processes yesterday's data
# If a node dies mid-job, Spark retries the failed tasks
spark.sql("""
INSERT OVERWRITE TABLE gold_daily_metrics
SELECT DATE(event_time) as day, COUNT(*) as events
FROM raw_events
WHERE DATE(event_time) = CURRENT_DATE - 1
GROUP BY day
""")

3. Machine Learning at Scale

Spark MLlib handles distributed training on massive datasets:

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
# Train on 1 billion rows — distributed across cluster
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
rf = RandomForestClassifier(numTrees=100, maxDepth=10)
pipeline = Pipeline(stages=[assembler, rf])
model = pipeline.fit(training_data) # distributed training

4. Streaming

Spark Structured Streaming handles real-time data ingestion and processing:

stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "events") \
.load()
# Process in real-time, write to data lake
stream.writeStream \
.format("parquet") \
.option("path", "s3://bucket/raw/events/") \
.option("checkpointLocation", "s3://bucket/checkpoints/events/") \
.start()

DuckDB does not do streaming. It is a batch query engine.

DataSpoc’s Bet: DuckDB for 90% of Teams

Most companies have data lakes under 100GB. Most analytical queries are ad-hoc: someone asks a question, an analyst writes SQL, the answer comes back. For this workload, DuckDB is strictly better than Spark:

  • Faster for datasets under 100GB
  • Cheaper by 10-50x (no cluster to run)
  • Simpler to operate (no JVM, no YARN, no cluster config)
  • Better for AI agents (instant startup, sub-second queries)
from dataspoc_lens import LensClient
lens = LensClient()
# This is all you need. No cluster. No JVM. No config.
df = lens.query("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")
print(df)

If you outgrow DuckDB — your lake passes 200GB, you need streaming, you need distributed ML — Spark is there. The data is still Parquet in a cloud bucket. Switching query engines does not require migrating data.

That is the beauty of the open format: Parquet does not care what reads it. Start with DuckDB. Graduate to Spark if and when you need it. Most teams never will.

Quick Decision Framework

Is your data lake > 200GB?
→ Yes: Consider Spark
→ No: Use DuckDB (DataSpoc Lens)
Do you need real-time streaming?
→ Yes: Spark Structured Streaming
→ No: DuckDB
Are you building production ETL that runs 24/7?
→ Yes: Spark (fault tolerance matters)
→ No: DuckDB
Are you doing ad-hoc analysis or AI agent queries?
→ Yes: DuckDB (speed matters)
→ No: Either works
What is your infrastructure budget?
→ $0-100/month: DuckDB (no cluster needed)
→ $500+/month: Spark is viable

For 90% of data teams, the answer is DuckDB. DataSpoc Lens makes it effortless.

Recommended