duckdbsparkcomparisondata-lakeperformance

DuckDB vs Spark for Data Lake Queries: When Each Wins

Michael San Martim · 2026-04-24

DataSpoc Lens uses DuckDB to query Parquet files in cloud storage. Databricks uses Spark. Both can query data lakes. They are built for fundamentally different scales and use cases. This post gives you the data to choose the right one.

The Core Difference

DuckDB is an in-process analytical database. It runs inside your Python process, your CLI, or your laptop. Zero infrastructure. One process. One machine.

Spark is a distributed computation engine. It runs across a cluster of machines. Requires infrastructure: a cluster manager (YARN, Kubernetes, or Databricks), driver nodes, worker nodes, configuration.

DuckDB:
  Your laptop → DuckDB (in-process) → Parquet in S3 → Results

Spark:
  Your laptop → Spark Driver → Cluster Manager → N Worker Nodes → Parquet in S3 → Shuffle → Results

Setup Comparison

DuckDB (via DataSpoc Lens)

pip install dataspoc-lens

from dataspoc_lens import LensClient

lens = LensClient()
df = lens.query("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")
print(df)

Time to first query: under 30 seconds (install + run).

Spark (via PySpark)

pip install pyspark
# Also need: Java 8+, Hadoop config, S3 credentials config

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("data-lake-query") \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider",
            "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.parquet.enableVectorizedReader", "true") \
    .getOrCreate()

df = spark.read.parquet("s3a://my-data-lake/curated/sales/")
df.createOrReplaceTempView("curated_sales")

result = spark.sql("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")
result.show()

Time to first query: 5-15 minutes (install Java, configure Hadoop, debug S3 auth, wait for Spark startup).

On Databricks, setup is faster (managed service), but you are paying $0.15-0.75 per DBU-hour from the moment you start a cluster.

Benchmark: Same Queries, Different Scales

We ran 5 queries across 4 dataset sizes on Parquet files in S3. DuckDB ran on a single machine (8 cores, 32GB RAM). Spark ran on a 3-node cluster (8 cores, 32GB RAM each = 24 cores, 96GB total).

Query 1: Simple Aggregation

SELECT region, SUM(amount) as total, COUNT(*) as orders
FROM sales
GROUP BY region
ORDER BY total DESC

Dataset Size	DuckDB	Spark	Winner
1M rows (200MB)	0.3s	8.2s	DuckDB (27x)
10M rows (2GB)	1.8s	9.5s	DuckDB (5x)
100M rows (20GB)	14s	18s	DuckDB (1.3x)
1B rows (200GB)	OOM	45s	Spark

Query 2: Join + Aggregation

SELECT c.segment, DATE_TRUNC('month', s.sale_date) as month,
       SUM(s.amount) as revenue, COUNT(DISTINCT s.customer_id) as customers
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.segment, month
ORDER BY month, revenue DESC

Dataset Size	DuckDB	Spark	Winner
1M rows	0.5s	12s	DuckDB (24x)
10M rows	3.2s	14s	DuckDB (4x)
100M rows	28s	22s	Spark (1.3x)
1B rows	OOM	68s	Spark

Query 3: Window Function

SELECT customer_id, sale_date, amount,
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY sale_date) as running_total,
       ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sale_date DESC) as recency_rank
FROM sales
WHERE sale_date >= '2025-01-01'

Dataset Size	DuckDB	Spark	Winner
1M rows	0.8s	15s	DuckDB (19x)
10M rows	6.5s	18s	DuckDB (3x)
100M rows	52s	35s	Spark (1.5x)
1B rows	OOM	120s	Spark

Query 4: Full Table Scan with Complex Filter

SELECT * FROM sales
WHERE amount > 100
  AND region IN ('NA', 'EU')
  AND sale_date BETWEEN '2025-06-01' AND '2025-12-31'
  AND product LIKE '%Pro%'
ORDER BY amount DESC
LIMIT 1000

Dataset Size	DuckDB	Spark	Winner
1M rows	0.2s	7s	DuckDB (35x)
10M rows	0.9s	8s	DuckDB (9x)
100M rows	6s	12s	DuckDB (2x)
1B rows	45s	15s	Spark (3x)

Query 5: Ad-Hoc Exploration

-- "What does this data look like?"
SELECT * FROM sales LIMIT 100

Dataset Size	DuckDB	Spark	Winner
Any size	0.1s	5-8s	DuckDB (50-80x)

Spark’s overhead is constant: cluster initialization, job scheduling, task distribution. For ad-hoc exploration, DuckDB is always faster because there is no cluster to coordinate.

Cost Comparison

DuckDB (via DataSpoc Lens)

Software: $0 (open-source)
Compute: Your laptop (free) or a VM ($50-150/month)
Storage: S3/GCS/Azure ($0.02/GB/month)

Monthly cost for 500GB data lake: ~$60

Spark on Databricks

Software: Databricks DBU pricing ($0.15-0.75/DBU-hour)
Compute: Cluster instances (e.g., 3x i3.xlarge = $0.94/hour)
Storage: S3/GCS/Azure ($0.02/GB/month)
Cluster uptime: Even "serverless" has startup cost

Monthly cost for 500GB data lake with moderate query load:
  Cluster: $200-800/month (depending on uptime)
  DBUs: $150-500/month
  Storage: $10/month
  Total: ~$360-1,310/month

Spark on EMR

EMR fee: $0.015-0.27/hour per instance
EC2 instances: 3x m5.xlarge = $0.576/hour
Storage: $10/month

Monthly cost (8 hours/day usage): ~$300-500/month

When DuckDB Wins

1. Data Fits in Memory (< 100GB)

DuckDB processes data in-memory on a single machine. For datasets under 100GB (which covers the vast majority of company data lakes), it is faster than Spark because there is no cluster overhead.

from dataspoc_lens import LensClient

lens = LensClient()

# 50GB dataset? DuckDB handles it fine on a 64GB machine
df = lens.query("""
    SELECT category, DATE_TRUNC('week', event_date) as week,
           COUNT(*) as events, COUNT(DISTINCT user_id) as users
    FROM curated_events
    GROUP BY category, week
    ORDER BY week DESC, events DESC
""")

2. Ad-Hoc Exploration and Development

When you are exploring data, writing queries iteratively, and checking results, DuckDB’s sub-second response time changes how you work:

# Exploration loop — instant feedback
lens.query("SELECT * FROM curated_sales LIMIT 5")         # 0.1s
lens.query("SELECT DISTINCT region FROM curated_sales")    # 0.2s
lens.query("SELECT region, COUNT(*) FROM curated_sales GROUP BY 1")  # 0.3s
# vs. Spark: 8s, 8s, 9s for the same queries

3. CI/CD and Testing

Running analytical queries in CI pipelines requires fast startup and zero infrastructure:

# GitHub Actions — test your SQL transformations
- name: Test data transformations
  run: |
    pip install dataspoc-lens
    python -c "
    from dataspoc_lens import LensClient
    lens = LensClient()
    result = lens.query('SELECT COUNT(*) as cnt FROM curated_sales')
    assert result['cnt'].iloc[0] > 0, 'Sales table is empty'
    "

4. AI Agent Integration

DuckDB starts instantly, which is critical for AI agents that need to execute multiple queries in a conversation:

# Agent loop: 5 queries in sequence
# DuckDB: 0.3s + 0.2s + 0.5s + 0.3s + 0.4s = 1.7s total
# Spark:   8s + 8s + 9s + 8s + 9s = 42s total

# Users will not wait 42 seconds for each agent response

When Spark Wins

1. Data Exceeds Single-Machine Memory (> 200GB)

When your dataset is larger than what fits in RAM, Spark distributes the work across a cluster:

# 2TB dataset — Spark distributes across 20 nodes
spark.sql("""
    SELECT user_id, SUM(amount) as lifetime_value
    FROM massive_events  -- 2TB, 10 billion rows
    GROUP BY user_id
    HAVING lifetime_value > 1000
""").write.parquet("s3://bucket/gold/high_value_users/")

2. Production ETL Pipelines

Spark’s fault tolerance (RDD lineage, speculative execution, task retry) makes it reliable for scheduled production jobs:

# Daily pipeline that processes yesterday's data
# If a node dies mid-job, Spark retries the failed tasks
spark.sql("""
    INSERT OVERWRITE TABLE gold_daily_metrics
    SELECT DATE(event_time) as day, COUNT(*) as events
    FROM raw_events
    WHERE DATE(event_time) = CURRENT_DATE - 1
    GROUP BY day
""")

3. Machine Learning at Scale

Spark MLlib handles distributed training on massive datasets:

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler

# Train on 1 billion rows — distributed across cluster
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
rf = RandomForestClassifier(numTrees=100, maxDepth=10)
pipeline = Pipeline(stages=[assembler, rf])
model = pipeline.fit(training_data)  # distributed training

4. Streaming

Spark Structured Streaming handles real-time data ingestion and processing:

stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "events") \
    .load()

# Process in real-time, write to data lake
stream.writeStream \
    .format("parquet") \
    .option("path", "s3://bucket/raw/events/") \
    .option("checkpointLocation", "s3://bucket/checkpoints/events/") \
    .start()

DuckDB does not do streaming. It is a batch query engine.

DataSpoc’s Bet: DuckDB for 90% of Teams

Most companies have data lakes under 100GB. Most analytical queries are ad-hoc: someone asks a question, an analyst writes SQL, the answer comes back. For this workload, DuckDB is strictly better than Spark:

Faster for datasets under 100GB
Cheaper by 10-50x (no cluster to run)
Simpler to operate (no JVM, no YARN, no cluster config)
Better for AI agents (instant startup, sub-second queries)

from dataspoc_lens import LensClient

lens = LensClient()

# This is all you need. No cluster. No JVM. No config.
df = lens.query("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")
print(df)

If you outgrow DuckDB — your lake passes 200GB, you need streaming, you need distributed ML — Spark is there. The data is still Parquet in a cloud bucket. Switching query engines does not require migrating data.

That is the beauty of the open format: Parquet does not care what reads it. Start with DuckDB. Graduate to Spark if and when you need it. Most teams never will.

Quick Decision Framework

Is your data lake > 200GB?
  → Yes: Consider Spark
  → No: Use DuckDB (DataSpoc Lens)

Do you need real-time streaming?
  → Yes: Spark Structured Streaming
  → No: DuckDB

Are you building production ETL that runs 24/7?
  → Yes: Spark (fault tolerance matters)
  → No: DuckDB

Are you doing ad-hoc analysis or AI agent queries?
  → Yes: DuckDB (speed matters)
  → No: Either works

What is your infrastructure budget?
  → $0-100/month: DuckDB (no cluster needed)
  → $500+/month: Spark is viable

For 90% of data teams, the answer is DuckDB. DataSpoc Lens makes it effortless.

← Back to blog

DuckDB vs Spark for Data Lake Queries: When Each Wins

The Core Difference

Setup Comparison

DuckDB (via DataSpoc Lens)

Spark (via PySpark)

Benchmark: Same Queries, Different Scales

Query 1: Simple Aggregation

Query 2: Join + Aggregation

Query 3: Window Function

Query 4: Full Table Scan with Complex Filter

Query 5: Ad-Hoc Exploration

Cost Comparison

DuckDB (via DataSpoc Lens)

Spark on Databricks

Spark on EMR

When DuckDB Wins

1. Data Fits in Memory (< 100GB)

2. Ad-Hoc Exploration and Development

3. CI/CD and Testing

4. AI Agent Integration

When Spark Wins

1. Data Exceeds Single-Machine Memory (> 200GB)

2. Production ETL Pipelines

3. Machine Learning at Scale

4. Streaming

DataSpoc’s Bet: DuckDB for 90% of Teams

Quick Decision Framework

Recommended

RAG on Parquet: How to Build Retrieval Over Your Data Lake Without Embeddings

The $0 Data Lake for Startups: DataSpoc + S3 in 30 Minutes

Jupyter Notebook on Your Data Lake in 60 Seconds