Jupyter Notebook on Your Data Lake in 60 Seconds
You want to explore your data lake in a Jupyter notebook. Here is what that normally looks like:
- Install JupyterLab
- Install DuckDB and the Python bindings
- Figure out which S3 bucket has your data
- Install boto3 and configure AWS credentials
- Write the DuckDB S3 configuration boilerplate
- Manually discover what tables exist
- Write the Parquet read statements for each table
- Hope you got the paths right
That is 20 minutes of setup before you write a single analytical query. Every time.
The DataSpoc Lens Way
dataspoc-lens add-bucket company s3://company-data-lakedataspoc-lens notebookThat is it. JupyterLab opens in your browser with every table in your data lake already mounted as a DuckDB view. SQL magic works out of the box. The conn variable is ready for pandas.
What Happens Under the Hood
When you run dataspoc-lens notebook, Lens:
- Reads the manifest from
s3://company-data-lake/.dataspoc/manifest.json - Creates a DuckDB connection with S3 credentials from your environment
- Mounts every Parquet path as a named view
- Generates a startup script (
~/.dataspoc/jupyter_startup.py) - Launches JupyterLab with the startup script pre-loaded
The startup script looks like this (auto-generated, you never edit it):
# ~/.dataspoc/jupyter_startup.py (auto-generated)import duckdb
conn = duckdb.connect()
# S3 configuration (from environment)conn.execute("INSTALL httpfs; LOAD httpfs;")conn.execute("SET s3_region = 'us-east-1';")
# Mount all tables from manifestconn.execute(""" CREATE VIEW raw__postgres__orders AS SELECT * FROM read_parquet('s3://company-data-lake/raw/postgres/orders/*.parquet')""")conn.execute(""" CREATE VIEW raw__postgres__customers AS SELECT * FROM read_parquet('s3://company-data-lake/raw/postgres/customers/*.parquet')""")conn.execute(""" CREATE VIEW curated__sales__customers AS SELECT * FROM read_parquet('s3://company-data-lake/curated/sales/customers/*.parquet')""")# ... one view per table in your lake
print(f"DataSpoc Lens: {len(conn.execute('SHOW TABLES').fetchall())} tables mounted")print("Use 'conn' for queries or %%sql magic for inline SQL")Example Notebook Cells
Cell 1: See What is Available
# All your tables are already hereconn.sql("SHOW TABLES").show()┌─────────────────────────────────────┐│ name │├─────────────────────────────────────┤│ raw__postgres__orders ││ raw__postgres__customers ││ raw__postgres__order_items ││ curated__sales__customers ││ raw__google_sheets__campaigns │└─────────────────────────────────────┘Cell 2: SQL Magic
%%sqlSELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as orders, SUM(amount) as revenueFROM raw__postgres__ordersWHERE created_at >= '2026-01-01'GROUP BY monthORDER BY month┌────────────┬────────┬───────────┐│ month │ orders │ revenue │├────────────┼────────┼───────────┤│ 2026-01-01 │ 3,421 │ 284,500 ││ 2026-02-01 │ 3,892 │ 312,100 ││ 2026-03-01 │ 4,156 │ 341,800 ││ 2026-04-01 │ 2,847 │ 228,400 │└────────────┴────────┴───────────┘Cell 3: Pandas DataFrame
df = conn.sql(""" SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spend, MAX(created_at) as last_order FROM raw__postgres__orders GROUP BY customer_id""").df()
df.describe() order_count total_spendcount 3201.000000 3201.000000mean 14.340000 832.450000std 8.230000 456.780000min 1.000000 12.990000max 89.000000 8945.000000Cell 4: Matplotlib Chart
import matplotlib.pyplot as plt
monthly = conn.sql(""" SELECT DATE_TRUNC('month', created_at) as month, SUM(amount) as revenue FROM raw__postgres__orders WHERE created_at >= '2025-01-01' GROUP BY month ORDER BY month""").df()
plt.figure(figsize=(12, 5))plt.bar(monthly["month"].dt.strftime("%b %Y"), monthly["revenue"])plt.title("Monthly Revenue")plt.ylabel("Revenue ($)")plt.xticks(rotation=45)plt.tight_layout()plt.show()Cell 5: Join Multiple Sources
%%sqlSELECT c.channel, COUNT(DISTINCT o.customer_id) as customers, SUM(o.amount) as revenue, ROUND(SUM(o.amount) / COUNT(DISTINCT o.customer_id), 2) as avg_revenue_per_customerFROM raw__postgres__orders oJOIN raw__postgres__customers cu ON o.customer_id = cu.idJOIN raw__google_sheets__campaigns c ON cu.utm_source = c.channelGROUP BY c.channelORDER BY revenue DESCThe Marimo Alternative
Prefer reactive notebooks? Lens also supports Marimo:
dataspoc-lens notebook --runtime marimoThis launches a Marimo notebook with the same pre-mounted tables:
import marimo as moimport duckdb
# conn is pre-configured (same startup script)conn = duckdb.connect()# ... tables already mounted
# Marimo's built-in SQL supportresult = mo.sql(""" SELECT channel, SUM(amount) as revenue FROM raw__postgres__orders GROUP BY channel""", connection=conn)
# Reactive: change the query, chart updates automaticallymo.ui.table(result)Marimo advantages:
- Reactive execution (cells re-run when dependencies change)
- Built-in UI widgets (sliders, dropdowns)
- Reproducible (no hidden state)
- Exports to Python scripts
Docker Demo: Full Self-Contained Environment
Want to try it without installing anything? Use the Docker demo:
docker run -it --rm \ -p 8888:8888 \ -e AWS_ACCESS_KEY_ID \ -e AWS_SECRET_ACCESS_KEY \ -e AWS_DEFAULT_REGION \ dataspoc/lens-notebook:latest \ --bucket s3://company-data-lakeThis gives you:
- JupyterLab running on
http://localhost:8888 - DuckDB with all tables from your bucket mounted
%%sqlmagic pre-configured- pandas, matplotlib, seaborn, plotly pre-installed
connvariable ready to use
Docker Compose for Team Use
version: "3.8"services: notebook: image: dataspoc/lens-notebook:latest ports: - "8888:8888" environment: - AWS_ACCESS_KEY_ID - AWS_SECRET_ACCESS_KEY - AWS_DEFAULT_REGION - DATASPOC_BUCKET=s3://company-data-lake volumes: - ./notebooks:/home/jovyan/notebooksdocker compose upShare the docker-compose.yaml with your team. Everyone gets the same environment, same tables, same SQL views. No “works on my machine” issues.
Tips
Multiple buckets:
dataspoc-lens add-bucket sales s3://company-salesdataspoc-lens add-bucket marketing s3://company-marketingdataspoc-lens notebook# Both buckets' tables are mountedRefresh tables without restarting:
# In a notebook cellfrom dataspoc_lens import LensClientlens = LensClient()lens.refresh() # Re-reads manifest, mounts new tablesUse with VS Code Jupyter:
dataspoc-lens notebook --no-browser --ip 0.0.0.0# Then connect VS Code to the running Jupyter serverThe 60-Second Summary
| Step | Command | Time |
|---|---|---|
| Connect bucket | dataspoc-lens add-bucket company s3://... | 2s |
| Launch notebook | dataspoc-lens notebook | 5s |
| Start querying | %%sql SELECT ... | 0s (tables are already there) |
No boto3. No credential configuration. No Parquet path guessing. No DuckDB setup boilerplate.
Just open the notebook and write SQL.