jupyternotebookduckdbdata-sciencepython

Jupyter Notebook on Your Data Lake in 60 Seconds

Michael San Martim · 2026-04-27

You want to explore your data lake in a Jupyter notebook. Here is what that normally looks like:

Install JupyterLab
Install DuckDB and the Python bindings
Figure out which S3 bucket has your data
Install boto3 and configure AWS credentials
Write the DuckDB S3 configuration boilerplate
Manually discover what tables exist
Write the Parquet read statements for each table
Hope you got the paths right

That is 20 minutes of setup before you write a single analytical query. Every time.

The DataSpoc Lens Way

dataspoc-lens add-bucket company s3://company-data-lake
dataspoc-lens notebook

That is it. JupyterLab opens in your browser with every table in your data lake already mounted as a DuckDB view. SQL magic works out of the box. The conn variable is ready for pandas.

What Happens Under the Hood

When you run dataspoc-lens notebook, Lens:

Reads the manifest from s3://company-data-lake/.dataspoc/manifest.json
Creates a DuckDB connection with S3 credentials from your environment
Mounts every Parquet path as a named view
Generates a startup script (~/.dataspoc/jupyter_startup.py)
Launches JupyterLab with the startup script pre-loaded

The startup script looks like this (auto-generated, you never edit it):

# ~/.dataspoc/jupyter_startup.py (auto-generated)
import duckdb

conn = duckdb.connect()

# S3 configuration (from environment)
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute("SET s3_region = 'us-east-1';")

# Mount all tables from manifest
conn.execute("""
    CREATE VIEW raw__postgres__orders AS
    SELECT * FROM read_parquet('s3://company-data-lake/raw/postgres/orders/*.parquet')
""")
conn.execute("""
    CREATE VIEW raw__postgres__customers AS
    SELECT * FROM read_parquet('s3://company-data-lake/raw/postgres/customers/*.parquet')
""")
conn.execute("""
    CREATE VIEW curated__sales__customers AS
    SELECT * FROM read_parquet('s3://company-data-lake/curated/sales/customers/*.parquet')
""")
# ... one view per table in your lake

print(f"DataSpoc Lens: {len(conn.execute('SHOW TABLES').fetchall())} tables mounted")
print("Use 'conn' for queries or %%sql magic for inline SQL")

Example Notebook Cells

Cell 1: See What is Available

# All your tables are already here
conn.sql("SHOW TABLES").show()

┌─────────────────────────────────────┐
│ name                                │
├─────────────────────────────────────┤
│ raw__postgres__orders               │
│ raw__postgres__customers            │
│ raw__postgres__order_items          │
│ curated__sales__customers           │
│ raw__google_sheets__campaigns       │
└─────────────────────────────────────┘

Cell 2: SQL Magic

%%sql
SELECT
    DATE_TRUNC('month', created_at) as month,
    COUNT(*) as orders,
    SUM(amount) as revenue
FROM raw__postgres__orders
WHERE created_at >= '2026-01-01'
GROUP BY month
ORDER BY month

┌────────────┬────────┬───────────┐
│ month      │ orders │ revenue   │
├────────────┼────────┼───────────┤
│ 2026-01-01 │ 3,421  │ 284,500   │
│ 2026-02-01 │ 3,892  │ 312,100   │
│ 2026-03-01 │ 4,156  │ 341,800   │
│ 2026-04-01 │ 2,847  │ 228,400   │
└────────────┴────────┴───────────┘

Cell 3: Pandas DataFrame

df = conn.sql("""
    SELECT
        customer_id,
        COUNT(*) as order_count,
        SUM(amount) as total_spend,
        MAX(created_at) as last_order
    FROM raw__postgres__orders
    GROUP BY customer_id
""").df()

df.describe()

       order_count  total_spend
count  3201.000000  3201.000000
mean      14.340000   832.450000
std        8.230000   456.780000
min        1.000000    12.990000
max       89.000000  8945.000000

Cell 4: Matplotlib Chart

import matplotlib.pyplot as plt

monthly = conn.sql("""
    SELECT
        DATE_TRUNC('month', created_at) as month,
        SUM(amount) as revenue
    FROM raw__postgres__orders
    WHERE created_at >= '2025-01-01'
    GROUP BY month
    ORDER BY month
""").df()

plt.figure(figsize=(12, 5))
plt.bar(monthly["month"].dt.strftime("%b %Y"), monthly["revenue"])
plt.title("Monthly Revenue")
plt.ylabel("Revenue ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Cell 5: Join Multiple Sources

%%sql
SELECT
    c.channel,
    COUNT(DISTINCT o.customer_id) as customers,
    SUM(o.amount) as revenue,
    ROUND(SUM(o.amount) / COUNT(DISTINCT o.customer_id), 2) as avg_revenue_per_customer
FROM raw__postgres__orders o
JOIN raw__postgres__customers cu ON o.customer_id = cu.id
JOIN raw__google_sheets__campaigns c ON cu.utm_source = c.channel
GROUP BY c.channel
ORDER BY revenue DESC

The Marimo Alternative

Prefer reactive notebooks? Lens also supports Marimo:

dataspoc-lens notebook --runtime marimo

This launches a Marimo notebook with the same pre-mounted tables:

import marimo as mo
import duckdb

# conn is pre-configured (same startup script)
conn = duckdb.connect()
# ... tables already mounted

# Marimo's built-in SQL support
result = mo.sql("""
    SELECT channel, SUM(amount) as revenue
    FROM raw__postgres__orders
    GROUP BY channel
""", connection=conn)

# Reactive: change the query, chart updates automatically
mo.ui.table(result)

Marimo advantages:

Reactive execution (cells re-run when dependencies change)
Built-in UI widgets (sliders, dropdowns)
Reproducible (no hidden state)
Exports to Python scripts

Docker Demo: Full Self-Contained Environment

Want to try it without installing anything? Use the Docker demo:

docker run -it --rm \
  -p 8888:8888 \
  -e AWS_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY \
  -e AWS_DEFAULT_REGION \
  dataspoc/lens-notebook:latest \
  --bucket s3://company-data-lake

This gives you:

JupyterLab running on http://localhost:8888
DuckDB with all tables from your bucket mounted
%%sql magic pre-configured
pandas, matplotlib, seaborn, plotly pre-installed
conn variable ready to use

Docker Compose for Team Use

version: "3.8"
services:
  notebook:
    image: dataspoc/lens-notebook:latest
    ports:
      - "8888:8888"
    environment:
      - AWS_ACCESS_KEY_ID
      - AWS_SECRET_ACCESS_KEY
      - AWS_DEFAULT_REGION
      - DATASPOC_BUCKET=s3://company-data-lake
    volumes:
      - ./notebooks:/home/jovyan/notebooks

docker compose up

Share the docker-compose.yaml with your team. Everyone gets the same environment, same tables, same SQL views. No “works on my machine” issues.

Tips

Multiple buckets:

dataspoc-lens add-bucket sales s3://company-sales
dataspoc-lens add-bucket marketing s3://company-marketing
dataspoc-lens notebook
# Both buckets' tables are mounted

Refresh tables without restarting:

# In a notebook cell
from dataspoc_lens import LensClient
lens = LensClient()
lens.refresh()  # Re-reads manifest, mounts new tables

Use with VS Code Jupyter:

dataspoc-lens notebook --no-browser --ip 0.0.0.0
# Then connect VS Code to the running Jupyter server

The 60-Second Summary

Step	Command	Time
Connect bucket	`dataspoc-lens add-bucket company s3://...`	2s
Launch notebook	`dataspoc-lens notebook`	5s
Start querying	`%%sql SELECT ...`	0s (tables are already there)

No boto3. No credential configuration. No Parquet path guessing. No DuckDB setup boilerplate.

Just open the notebook and write SQL.

← Back to blog