jupyternotebookduckdbdata-sciencepython

Jupyter Notebook on Your Data Lake in 60 Seconds

Michael San Martim · 2026-04-27

You want to explore your data lake in a Jupyter notebook. Here is what that normally looks like:

  1. Install JupyterLab
  2. Install DuckDB and the Python bindings
  3. Figure out which S3 bucket has your data
  4. Install boto3 and configure AWS credentials
  5. Write the DuckDB S3 configuration boilerplate
  6. Manually discover what tables exist
  7. Write the Parquet read statements for each table
  8. Hope you got the paths right

That is 20 minutes of setup before you write a single analytical query. Every time.

The DataSpoc Lens Way

Terminal window
dataspoc-lens add-bucket company s3://company-data-lake
dataspoc-lens notebook

That is it. JupyterLab opens in your browser with every table in your data lake already mounted as a DuckDB view. SQL magic works out of the box. The conn variable is ready for pandas.

What Happens Under the Hood

When you run dataspoc-lens notebook, Lens:

  1. Reads the manifest from s3://company-data-lake/.dataspoc/manifest.json
  2. Creates a DuckDB connection with S3 credentials from your environment
  3. Mounts every Parquet path as a named view
  4. Generates a startup script (~/.dataspoc/jupyter_startup.py)
  5. Launches JupyterLab with the startup script pre-loaded

The startup script looks like this (auto-generated, you never edit it):

# ~/.dataspoc/jupyter_startup.py (auto-generated)
import duckdb
conn = duckdb.connect()
# S3 configuration (from environment)
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute("SET s3_region = 'us-east-1';")
# Mount all tables from manifest
conn.execute("""
CREATE VIEW raw__postgres__orders AS
SELECT * FROM read_parquet('s3://company-data-lake/raw/postgres/orders/*.parquet')
""")
conn.execute("""
CREATE VIEW raw__postgres__customers AS
SELECT * FROM read_parquet('s3://company-data-lake/raw/postgres/customers/*.parquet')
""")
conn.execute("""
CREATE VIEW curated__sales__customers AS
SELECT * FROM read_parquet('s3://company-data-lake/curated/sales/customers/*.parquet')
""")
# ... one view per table in your lake
print(f"DataSpoc Lens: {len(conn.execute('SHOW TABLES').fetchall())} tables mounted")
print("Use 'conn' for queries or %%sql magic for inline SQL")

Example Notebook Cells

Cell 1: See What is Available

# All your tables are already here
conn.sql("SHOW TABLES").show()
┌─────────────────────────────────────┐
│ name │
├─────────────────────────────────────┤
│ raw__postgres__orders │
│ raw__postgres__customers │
│ raw__postgres__order_items │
│ curated__sales__customers │
│ raw__google_sheets__campaigns │
└─────────────────────────────────────┘

Cell 2: SQL Magic

%%sql
SELECT
DATE_TRUNC('month', created_at) as month,
COUNT(*) as orders,
SUM(amount) as revenue
FROM raw__postgres__orders
WHERE created_at >= '2026-01-01'
GROUP BY month
ORDER BY month
┌────────────┬────────┬───────────┐
│ month │ orders │ revenue │
├────────────┼────────┼───────────┤
│ 2026-01-01 │ 3,421 │ 284,500 │
│ 2026-02-01 │ 3,892 │ 312,100 │
│ 2026-03-01 │ 4,156 │ 341,800 │
│ 2026-04-01 │ 2,847 │ 228,400 │
└────────────┴────────┴───────────┘

Cell 3: Pandas DataFrame

df = conn.sql("""
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spend,
MAX(created_at) as last_order
FROM raw__postgres__orders
GROUP BY customer_id
""").df()
df.describe()
order_count total_spend
count 3201.000000 3201.000000
mean 14.340000 832.450000
std 8.230000 456.780000
min 1.000000 12.990000
max 89.000000 8945.000000

Cell 4: Matplotlib Chart

import matplotlib.pyplot as plt
monthly = conn.sql("""
SELECT
DATE_TRUNC('month', created_at) as month,
SUM(amount) as revenue
FROM raw__postgres__orders
WHERE created_at >= '2025-01-01'
GROUP BY month
ORDER BY month
""").df()
plt.figure(figsize=(12, 5))
plt.bar(monthly["month"].dt.strftime("%b %Y"), monthly["revenue"])
plt.title("Monthly Revenue")
plt.ylabel("Revenue ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Cell 5: Join Multiple Sources

%%sql
SELECT
c.channel,
COUNT(DISTINCT o.customer_id) as customers,
SUM(o.amount) as revenue,
ROUND(SUM(o.amount) / COUNT(DISTINCT o.customer_id), 2) as avg_revenue_per_customer
FROM raw__postgres__orders o
JOIN raw__postgres__customers cu ON o.customer_id = cu.id
JOIN raw__google_sheets__campaigns c ON cu.utm_source = c.channel
GROUP BY c.channel
ORDER BY revenue DESC

The Marimo Alternative

Prefer reactive notebooks? Lens also supports Marimo:

Terminal window
dataspoc-lens notebook --runtime marimo

This launches a Marimo notebook with the same pre-mounted tables:

import marimo as mo
import duckdb
# conn is pre-configured (same startup script)
conn = duckdb.connect()
# ... tables already mounted
# Marimo's built-in SQL support
result = mo.sql("""
SELECT channel, SUM(amount) as revenue
FROM raw__postgres__orders
GROUP BY channel
""", connection=conn)
# Reactive: change the query, chart updates automatically
mo.ui.table(result)

Marimo advantages:

  • Reactive execution (cells re-run when dependencies change)
  • Built-in UI widgets (sliders, dropdowns)
  • Reproducible (no hidden state)
  • Exports to Python scripts

Docker Demo: Full Self-Contained Environment

Want to try it without installing anything? Use the Docker demo:

Terminal window
docker run -it --rm \
-p 8888:8888 \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_DEFAULT_REGION \
dataspoc/lens-notebook:latest \
--bucket s3://company-data-lake

This gives you:

  • JupyterLab running on http://localhost:8888
  • DuckDB with all tables from your bucket mounted
  • %%sql magic pre-configured
  • pandas, matplotlib, seaborn, plotly pre-installed
  • conn variable ready to use

Docker Compose for Team Use

docker-compose.yaml
version: "3.8"
services:
notebook:
image: dataspoc/lens-notebook:latest
ports:
- "8888:8888"
environment:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_DEFAULT_REGION
- DATASPOC_BUCKET=s3://company-data-lake
volumes:
- ./notebooks:/home/jovyan/notebooks
Terminal window
docker compose up

Share the docker-compose.yaml with your team. Everyone gets the same environment, same tables, same SQL views. No “works on my machine” issues.

Tips

Multiple buckets:

Terminal window
dataspoc-lens add-bucket sales s3://company-sales
dataspoc-lens add-bucket marketing s3://company-marketing
dataspoc-lens notebook
# Both buckets' tables are mounted

Refresh tables without restarting:

# In a notebook cell
from dataspoc_lens import LensClient
lens = LensClient()
lens.refresh() # Re-reads manifest, mounts new tables

Use with VS Code Jupyter:

Terminal window
dataspoc-lens notebook --no-browser --ip 0.0.0.0
# Then connect VS Code to the running Jupyter server

The 60-Second Summary

StepCommandTime
Connect bucketdataspoc-lens add-bucket company s3://...2s
Launch notebookdataspoc-lens notebook5s
Start querying%%sql SELECT ...0s (tables are already there)

No boto3. No credential configuration. No Parquet path guessing. No DuckDB setup boilerplate.

Just open the notebook and write SQL.

Recommended