DuckDB vs Spark for Data Lake Queries: When Each Wins
DataSpoc Lens uses DuckDB to query Parquet files in cloud storage. Databricks uses Spark. Both can query data lakes. They are built for fundamentally different scales and use cases. This post gives you the data to choose the right one.
The Core Difference
DuckDB is an in-process analytical database. It runs inside your Python process, your CLI, or your laptop. Zero infrastructure. One process. One machine.
Spark is a distributed computation engine. It runs across a cluster of machines. Requires infrastructure: a cluster manager (YARN, Kubernetes, or Databricks), driver nodes, worker nodes, configuration.
DuckDB: Your laptop → DuckDB (in-process) → Parquet in S3 → Results
Spark: Your laptop → Spark Driver → Cluster Manager → N Worker Nodes → Parquet in S3 → Shuffle → ResultsSetup Comparison
DuckDB (via DataSpoc Lens)
pip install dataspoc-lensfrom dataspoc_lens import LensClient
lens = LensClient()df = lens.query("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")print(df)Time to first query: under 30 seconds (install + run).
Spark (via PySpark)
pip install pyspark# Also need: Java 8+, Hadoop config, S3 credentials configfrom pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("data-lake-query") \ .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \ .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \ .config("spark.driver.memory", "4g") \ .config("spark.sql.parquet.enableVectorizedReader", "true") \ .getOrCreate()
df = spark.read.parquet("s3a://my-data-lake/curated/sales/")df.createOrReplaceTempView("curated_sales")
result = spark.sql("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")result.show()Time to first query: 5-15 minutes (install Java, configure Hadoop, debug S3 auth, wait for Spark startup).
On Databricks, setup is faster (managed service), but you are paying $0.15-0.75 per DBU-hour from the moment you start a cluster.
Benchmark: Same Queries, Different Scales
We ran 5 queries across 4 dataset sizes on Parquet files in S3. DuckDB ran on a single machine (8 cores, 32GB RAM). Spark ran on a 3-node cluster (8 cores, 32GB RAM each = 24 cores, 96GB total).
Query 1: Simple Aggregation
SELECT region, SUM(amount) as total, COUNT(*) as ordersFROM salesGROUP BY regionORDER BY total DESC| Dataset Size | DuckDB | Spark | Winner |
|---|---|---|---|
| 1M rows (200MB) | 0.3s | 8.2s | DuckDB (27x) |
| 10M rows (2GB) | 1.8s | 9.5s | DuckDB (5x) |
| 100M rows (20GB) | 14s | 18s | DuckDB (1.3x) |
| 1B rows (200GB) | OOM | 45s | Spark |
Query 2: Join + Aggregation
SELECT c.segment, DATE_TRUNC('month', s.sale_date) as month, SUM(s.amount) as revenue, COUNT(DISTINCT s.customer_id) as customersFROM sales sJOIN customers c ON s.customer_id = c.customer_idGROUP BY c.segment, monthORDER BY month, revenue DESC| Dataset Size | DuckDB | Spark | Winner |
|---|---|---|---|
| 1M rows | 0.5s | 12s | DuckDB (24x) |
| 10M rows | 3.2s | 14s | DuckDB (4x) |
| 100M rows | 28s | 22s | Spark (1.3x) |
| 1B rows | OOM | 68s | Spark |
Query 3: Window Function
SELECT customer_id, sale_date, amount, SUM(amount) OVER (PARTITION BY customer_id ORDER BY sale_date) as running_total, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sale_date DESC) as recency_rankFROM salesWHERE sale_date >= '2025-01-01'| Dataset Size | DuckDB | Spark | Winner |
|---|---|---|---|
| 1M rows | 0.8s | 15s | DuckDB (19x) |
| 10M rows | 6.5s | 18s | DuckDB (3x) |
| 100M rows | 52s | 35s | Spark (1.5x) |
| 1B rows | OOM | 120s | Spark |
Query 4: Full Table Scan with Complex Filter
SELECT * FROM salesWHERE amount > 100 AND region IN ('NA', 'EU') AND sale_date BETWEEN '2025-06-01' AND '2025-12-31' AND product LIKE '%Pro%'ORDER BY amount DESCLIMIT 1000| Dataset Size | DuckDB | Spark | Winner |
|---|---|---|---|
| 1M rows | 0.2s | 7s | DuckDB (35x) |
| 10M rows | 0.9s | 8s | DuckDB (9x) |
| 100M rows | 6s | 12s | DuckDB (2x) |
| 1B rows | 45s | 15s | Spark (3x) |
Query 5: Ad-Hoc Exploration
-- "What does this data look like?"SELECT * FROM sales LIMIT 100| Dataset Size | DuckDB | Spark | Winner |
|---|---|---|---|
| Any size | 0.1s | 5-8s | DuckDB (50-80x) |
Spark’s overhead is constant: cluster initialization, job scheduling, task distribution. For ad-hoc exploration, DuckDB is always faster because there is no cluster to coordinate.
Cost Comparison
DuckDB (via DataSpoc Lens)
Software: $0 (open-source)Compute: Your laptop (free) or a VM ($50-150/month)Storage: S3/GCS/Azure ($0.02/GB/month)
Monthly cost for 500GB data lake: ~$60Spark on Databricks
Software: Databricks DBU pricing ($0.15-0.75/DBU-hour)Compute: Cluster instances (e.g., 3x i3.xlarge = $0.94/hour)Storage: S3/GCS/Azure ($0.02/GB/month)Cluster uptime: Even "serverless" has startup cost
Monthly cost for 500GB data lake with moderate query load: Cluster: $200-800/month (depending on uptime) DBUs: $150-500/month Storage: $10/month Total: ~$360-1,310/monthSpark on EMR
EMR fee: $0.015-0.27/hour per instanceEC2 instances: 3x m5.xlarge = $0.576/hourStorage: $10/month
Monthly cost (8 hours/day usage): ~$300-500/monthWhen DuckDB Wins
1. Data Fits in Memory (< 100GB)
DuckDB processes data in-memory on a single machine. For datasets under 100GB (which covers the vast majority of company data lakes), it is faster than Spark because there is no cluster overhead.
from dataspoc_lens import LensClient
lens = LensClient()
# 50GB dataset? DuckDB handles it fine on a 64GB machinedf = lens.query(""" SELECT category, DATE_TRUNC('week', event_date) as week, COUNT(*) as events, COUNT(DISTINCT user_id) as users FROM curated_events GROUP BY category, week ORDER BY week DESC, events DESC""")2. Ad-Hoc Exploration and Development
When you are exploring data, writing queries iteratively, and checking results, DuckDB’s sub-second response time changes how you work:
# Exploration loop — instant feedbacklens.query("SELECT * FROM curated_sales LIMIT 5") # 0.1slens.query("SELECT DISTINCT region FROM curated_sales") # 0.2slens.query("SELECT region, COUNT(*) FROM curated_sales GROUP BY 1") # 0.3s# vs. Spark: 8s, 8s, 9s for the same queries3. CI/CD and Testing
Running analytical queries in CI pipelines requires fast startup and zero infrastructure:
# GitHub Actions — test your SQL transformations- name: Test data transformations run: | pip install dataspoc-lens python -c " from dataspoc_lens import LensClient lens = LensClient() result = lens.query('SELECT COUNT(*) as cnt FROM curated_sales') assert result['cnt'].iloc[0] > 0, 'Sales table is empty' "4. AI Agent Integration
DuckDB starts instantly, which is critical for AI agents that need to execute multiple queries in a conversation:
# Agent loop: 5 queries in sequence# DuckDB: 0.3s + 0.2s + 0.5s + 0.3s + 0.4s = 1.7s total# Spark: 8s + 8s + 9s + 8s + 9s = 42s total
# Users will not wait 42 seconds for each agent responseWhen Spark Wins
1. Data Exceeds Single-Machine Memory (> 200GB)
When your dataset is larger than what fits in RAM, Spark distributes the work across a cluster:
# 2TB dataset — Spark distributes across 20 nodesspark.sql(""" SELECT user_id, SUM(amount) as lifetime_value FROM massive_events -- 2TB, 10 billion rows GROUP BY user_id HAVING lifetime_value > 1000""").write.parquet("s3://bucket/gold/high_value_users/")2. Production ETL Pipelines
Spark’s fault tolerance (RDD lineage, speculative execution, task retry) makes it reliable for scheduled production jobs:
# Daily pipeline that processes yesterday's data# If a node dies mid-job, Spark retries the failed tasksspark.sql(""" INSERT OVERWRITE TABLE gold_daily_metrics SELECT DATE(event_time) as day, COUNT(*) as events FROM raw_events WHERE DATE(event_time) = CURRENT_DATE - 1 GROUP BY day""")3. Machine Learning at Scale
Spark MLlib handles distributed training on massive datasets:
from pyspark.ml.classification import RandomForestClassifierfrom pyspark.ml.feature import VectorAssembler
# Train on 1 billion rows — distributed across clusterassembler = VectorAssembler(inputCols=feature_cols, outputCol="features")rf = RandomForestClassifier(numTrees=100, maxDepth=10)pipeline = Pipeline(stages=[assembler, rf])model = pipeline.fit(training_data) # distributed training4. Streaming
Spark Structured Streaming handles real-time data ingestion and processing:
stream = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka:9092") \ .option("subscribe", "events") \ .load()
# Process in real-time, write to data lakestream.writeStream \ .format("parquet") \ .option("path", "s3://bucket/raw/events/") \ .option("checkpointLocation", "s3://bucket/checkpoints/events/") \ .start()DuckDB does not do streaming. It is a batch query engine.
DataSpoc’s Bet: DuckDB for 90% of Teams
Most companies have data lakes under 100GB. Most analytical queries are ad-hoc: someone asks a question, an analyst writes SQL, the answer comes back. For this workload, DuckDB is strictly better than Spark:
- Faster for datasets under 100GB
- Cheaper by 10-50x (no cluster to run)
- Simpler to operate (no JVM, no YARN, no cluster config)
- Better for AI agents (instant startup, sub-second queries)
from dataspoc_lens import LensClient
lens = LensClient()
# This is all you need. No cluster. No JVM. No config.df = lens.query("SELECT region, SUM(amount) FROM curated_sales GROUP BY region")print(df)If you outgrow DuckDB — your lake passes 200GB, you need streaming, you need distributed ML — Spark is there. The data is still Parquet in a cloud bucket. Switching query engines does not require migrating data.
That is the beauty of the open format: Parquet does not care what reads it. Start with DuckDB. Graduate to Spark if and when you need it. Most teams never will.
Quick Decision Framework
Is your data lake > 200GB? → Yes: Consider Spark → No: Use DuckDB (DataSpoc Lens)
Do you need real-time streaming? → Yes: Spark Structured Streaming → No: DuckDB
Are you building production ETL that runs 24/7? → Yes: Spark (fault tolerance matters) → No: DuckDB
Are you doing ad-hoc analysis or AI agent queries? → Yes: DuckDB (speed matters) → No: Either works
What is your infrastructure budget? → $0-100/month: DuckDB (no cluster needed) → $500+/month: Spark is viableFor 90% of data teams, the answer is DuckDB. DataSpoc Lens makes it effortless.