Reactive Data Analysis with Marimo and DataSpoc Lens
Jupyter notebooks have a dirty secret: cells can run in any order. You can have df defined in cell 10, used in cell 3, and overwritten in cell 7. The state is invisible. Bugs are guaranteed.
Marimo fixes this. It is a reactive notebook where cells automatically re-execute when their dependencies change. Change a filter in cell 2, and every cell that depends on it updates instantly. Pair it with DataSpoc Lens and you get a live, reactive interface to your data lake.
What Is Marimo?
Marimo is a Python notebook where:
- Cells auto-execute when you change an upstream variable
- No hidden state — the execution order is determined by data flow, not cell position
- Pure Python — notebooks are
.pyfiles, not JSON blobs - Built-in UI elements — sliders, dropdowns, tables that are reactive by default
Think of it as a spreadsheet for Python: change a cell, everything downstream updates.
Quick Start
Launch Marimo with DataSpoc Lens:
dataspoc-lens notebook --marimoThis starts a Marimo server pre-configured with a DuckDB connection to your data lake. All your buckets and tables are available immediately.
Alternatively, install and launch manually:
pip install marimo dataspoc-lensmarimo editCell 1: Connect to Your Data Lake
import marimo as mofrom dataspoc_lens import LensClient
lens = LensClient()tables = lens.tables()mo.md(f"**Connected.** {len(tables)} tables available.")Output: Connected. 12 tables available.
Cell 2: Interactive Table Selector
table_dropdown = mo.ui.dropdown( options=tables, label="Select a table",)table_dropdownThis renders a dropdown in the notebook. When you select a different table, every cell that references table_dropdown re-executes automatically.
Cell 3: Show Schema (Reactive)
# This cell re-runs whenever table_dropdown.value changesselected = table_dropdown.valueif selected: schema = lens.schema(selected) schema_table = mo.ui.table( [{"Column": c["name"], "Type": c["type"]} for c in schema], label=f"Schema: {selected}", ) schema_tableSelect “raw.orders” from the dropdown — this cell instantly shows the columns. Select “raw.customers” — it updates. No re-run button needed.
Cell 4: Dynamic Query with Filters
# Build a query based on the selected tableif selected: limit_slider = mo.ui.slider( start=10, stop=1000, step=10, value=100, label="Row limit", ) mo.hstack([limit_slider])# This cell depends on both selected and limit_sliderif selected: sql = f"SELECT * FROM {selected} LIMIT {limit_slider.value}" result = lens.query(sql) mo.ui.table(result.to_pandas())Move the slider from 100 to 500 — the query re-runs and the table updates. The reactive graph looks like:
table_dropdown → selected → sql → result → table display ↑ limit_slider ──┘Cell 5: Aggregation with Date Range
if selected == "raw.orders": date_range = mo.ui.date_range( start="2026-01-01", stop="2026-04-26", label="Date range", ) date_rangeif selected == "raw.orders" and date_range.value: start, end = date_range.value agg_sql = f""" SELECT DATE_TRUNC('week', order_date) as week, COUNT(*) as orders, SUM(amount) as revenue, ROUND(AVG(amount), 2) as avg_order FROM raw.orders WHERE order_date BETWEEN DATE '{start}' AND DATE '{end}' GROUP BY 1 ORDER BY 1 """ agg_result = lens.query(agg_sql) df = agg_result.to_pandas()
# Reactive chart import altair as alt chart = alt.Chart(df).mark_bar().encode( x="week:T", y="revenue:Q", tooltip=["week", "orders", "revenue", "avg_order"], ).properties(width=600, height=300)
mo.vstack([ mo.ui.table(df), chart, ])Change the date range picker — the aggregation query re-runs, the table updates, the chart redraws. All automatic.
The Programmatic connect() API
For scripts and automation, use connect() to get a DuckDB connection directly:
import duckdbfrom dataspoc_lens import LensClient
lens = LensClient()conn = lens.connect() # Returns a duckdb.DuckDBPyConnection
# Use standard DuckDB APIresult = conn.execute(""" SELECT plan, COUNT(*) as users FROM raw.customers GROUP BY plan ORDER BY users DESC""").fetchdf()
print(result)This is useful in Marimo cells when you want raw DuckDB access:
# Marimo cell using connect()conn = lens.connect()
# Complex analytical queryfunnel = conn.execute(""" WITH signups AS ( SELECT user_id, MIN(created_at) as signup_date FROM raw.events WHERE event_type = 'signup' GROUP BY user_id ), activations AS ( SELECT user_id, MIN(timestamp) as activation_date FROM raw.events WHERE event_type = 'first_purchase' GROUP BY user_id ) SELECT DATE_TRUNC('week', s.signup_date) as cohort_week, COUNT(DISTINCT s.user_id) as signups, COUNT(DISTINCT a.user_id) as activated, ROUND(100.0 * COUNT(DISTINCT a.user_id) / COUNT(DISTINCT s.user_id), 1) as activation_rate FROM signups s LEFT JOIN activations a ON s.user_id = a.user_id GROUP BY 1 ORDER BY 1""").fetchdf()
mo.ui.table(funnel)Marimo vs. Jupyter: A Real Comparison
Consider this workflow: explore revenue by region, then drill into the top region.
In Jupyter:
- Cell 1: Load data — Run
- Cell 2: Group by region — Run
- Cell 3: Plot results — Run
- Cell 4: Filter to top region — Run
- Oh wait, you want to change the date range in Cell 1
- Re-run Cell 1 — manually
- Re-run Cell 2 — manually
- Re-run Cell 3 — manually
- Re-run Cell 4 — manually
- Did you forget a cell? Is the state stale? Who knows.
In Marimo:
- Cell 1: Load data with date picker — it runs
- Cell 2: Group by region — auto-runs
- Cell 3: Plot results — auto-runs
- Cell 4: Filter to top region — auto-runs
- Change the date picker in Cell 1
- Cells 2, 3, 4 all update automatically
- State is always consistent
| Feature | Jupyter | Marimo |
|---|---|---|
| Execution order | Manual (any order) | Automatic (data flow) |
| Hidden state | Yes (constant problem) | No (impossible by design) |
| UI widgets | ipywidgets (manual wiring) | Built-in, reactive |
| File format | JSON (bad diffs) | Pure Python (clean diffs) |
| Reproducibility | Run All and hope | Guaranteed by design |
| DataSpoc integration | dataspoc-lens notebook | dataspoc-lens notebook --marimo |
Tips for Effective Reactive Notebooks
- One output per cell — Marimo cells return their last expression. Keep cells focused.
- Use UI elements for parameters —
mo.ui.slider,mo.ui.dropdown,mo.ui.textare all reactive. Use them instead of hardcoded values. - Cache heavy queries — use
lens.cache_refresh(table)before exploration sessions to avoid repeated cloud reads. - Name variables clearly — since Marimo tracks dependencies by variable name, clear names make the reactive graph easier to follow.
- Use
mo.stop()— to prevent a cell from running until a condition is met:
mo.stop(not table_dropdown.value, "Select a table above to continue.")# Code below only runs when a table is selectedMarimo makes data exploration feel like using a dashboard, but with the full power of Python and SQL. Combined with DataSpoc Lens, you get a reactive interface to terabytes of Parquet data in the cloud — no infrastructure, no stale state, no manual re-runs.