marimonotebookreactivedata-analysispython

Reactive Data Analysis with Marimo and DataSpoc Lens

Michael San Martim · 2026-04-26

Jupyter notebooks have a dirty secret: cells can run in any order. You can have df defined in cell 10, used in cell 3, and overwritten in cell 7. The state is invisible. Bugs are guaranteed.

Marimo fixes this. It is a reactive notebook where cells automatically re-execute when their dependencies change. Change a filter in cell 2, and every cell that depends on it updates instantly. Pair it with DataSpoc Lens and you get a live, reactive interface to your data lake.

What Is Marimo?

Marimo is a Python notebook where:

  • Cells auto-execute when you change an upstream variable
  • No hidden state — the execution order is determined by data flow, not cell position
  • Pure Python — notebooks are .py files, not JSON blobs
  • Built-in UI elements — sliders, dropdowns, tables that are reactive by default

Think of it as a spreadsheet for Python: change a cell, everything downstream updates.

Quick Start

Launch Marimo with DataSpoc Lens:

Terminal window
dataspoc-lens notebook --marimo

This starts a Marimo server pre-configured with a DuckDB connection to your data lake. All your buckets and tables are available immediately.

Alternatively, install and launch manually:

Terminal window
pip install marimo dataspoc-lens
marimo edit

Cell 1: Connect to Your Data Lake

import marimo as mo
from dataspoc_lens import LensClient
lens = LensClient()
tables = lens.tables()
mo.md(f"**Connected.** {len(tables)} tables available.")

Output: Connected. 12 tables available.

Cell 2: Interactive Table Selector

table_dropdown = mo.ui.dropdown(
options=tables,
label="Select a table",
)
table_dropdown

This renders a dropdown in the notebook. When you select a different table, every cell that references table_dropdown re-executes automatically.

Cell 3: Show Schema (Reactive)

# This cell re-runs whenever table_dropdown.value changes
selected = table_dropdown.value
if selected:
schema = lens.schema(selected)
schema_table = mo.ui.table(
[{"Column": c["name"], "Type": c["type"]} for c in schema],
label=f"Schema: {selected}",
)
schema_table

Select “raw.orders” from the dropdown — this cell instantly shows the columns. Select “raw.customers” — it updates. No re-run button needed.

Cell 4: Dynamic Query with Filters

# Build a query based on the selected table
if selected:
limit_slider = mo.ui.slider(
start=10, stop=1000, step=10, value=100,
label="Row limit",
)
mo.hstack([limit_slider])
# This cell depends on both selected and limit_slider
if selected:
sql = f"SELECT * FROM {selected} LIMIT {limit_slider.value}"
result = lens.query(sql)
mo.ui.table(result.to_pandas())

Move the slider from 100 to 500 — the query re-runs and the table updates. The reactive graph looks like:

table_dropdown → selected → sql → result → table display
limit_slider ──┘

Cell 5: Aggregation with Date Range

if selected == "raw.orders":
date_range = mo.ui.date_range(
start="2026-01-01",
stop="2026-04-26",
label="Date range",
)
date_range
if selected == "raw.orders" and date_range.value:
start, end = date_range.value
agg_sql = f"""
SELECT
DATE_TRUNC('week', order_date) as week,
COUNT(*) as orders,
SUM(amount) as revenue,
ROUND(AVG(amount), 2) as avg_order
FROM raw.orders
WHERE order_date BETWEEN DATE '{start}' AND DATE '{end}'
GROUP BY 1
ORDER BY 1
"""
agg_result = lens.query(agg_sql)
df = agg_result.to_pandas()
# Reactive chart
import altair as alt
chart = alt.Chart(df).mark_bar().encode(
x="week:T",
y="revenue:Q",
tooltip=["week", "orders", "revenue", "avg_order"],
).properties(width=600, height=300)
mo.vstack([
mo.ui.table(df),
chart,
])

Change the date range picker — the aggregation query re-runs, the table updates, the chart redraws. All automatic.

The Programmatic connect() API

For scripts and automation, use connect() to get a DuckDB connection directly:

import duckdb
from dataspoc_lens import LensClient
lens = LensClient()
conn = lens.connect() # Returns a duckdb.DuckDBPyConnection
# Use standard DuckDB API
result = conn.execute("""
SELECT plan, COUNT(*) as users
FROM raw.customers
GROUP BY plan
ORDER BY users DESC
""").fetchdf()
print(result)

This is useful in Marimo cells when you want raw DuckDB access:

# Marimo cell using connect()
conn = lens.connect()
# Complex analytical query
funnel = conn.execute("""
WITH signups AS (
SELECT user_id, MIN(created_at) as signup_date
FROM raw.events
WHERE event_type = 'signup'
GROUP BY user_id
),
activations AS (
SELECT user_id, MIN(timestamp) as activation_date
FROM raw.events
WHERE event_type = 'first_purchase'
GROUP BY user_id
)
SELECT
DATE_TRUNC('week', s.signup_date) as cohort_week,
COUNT(DISTINCT s.user_id) as signups,
COUNT(DISTINCT a.user_id) as activated,
ROUND(100.0 * COUNT(DISTINCT a.user_id) / COUNT(DISTINCT s.user_id), 1) as activation_rate
FROM signups s
LEFT JOIN activations a ON s.user_id = a.user_id
GROUP BY 1
ORDER BY 1
""").fetchdf()
mo.ui.table(funnel)

Marimo vs. Jupyter: A Real Comparison

Consider this workflow: explore revenue by region, then drill into the top region.

In Jupyter:

  1. Cell 1: Load data — Run
  2. Cell 2: Group by region — Run
  3. Cell 3: Plot results — Run
  4. Cell 4: Filter to top region — Run
  5. Oh wait, you want to change the date range in Cell 1
  6. Re-run Cell 1 — manually
  7. Re-run Cell 2 — manually
  8. Re-run Cell 3 — manually
  9. Re-run Cell 4 — manually
  10. Did you forget a cell? Is the state stale? Who knows.

In Marimo:

  1. Cell 1: Load data with date picker — it runs
  2. Cell 2: Group by region — auto-runs
  3. Cell 3: Plot results — auto-runs
  4. Cell 4: Filter to top region — auto-runs
  5. Change the date picker in Cell 1
  6. Cells 2, 3, 4 all update automatically
  7. State is always consistent
FeatureJupyterMarimo
Execution orderManual (any order)Automatic (data flow)
Hidden stateYes (constant problem)No (impossible by design)
UI widgetsipywidgets (manual wiring)Built-in, reactive
File formatJSON (bad diffs)Pure Python (clean diffs)
ReproducibilityRun All and hopeGuaranteed by design
DataSpoc integrationdataspoc-lens notebookdataspoc-lens notebook --marimo

Tips for Effective Reactive Notebooks

  1. One output per cell — Marimo cells return their last expression. Keep cells focused.
  2. Use UI elements for parametersmo.ui.slider, mo.ui.dropdown, mo.ui.text are all reactive. Use them instead of hardcoded values.
  3. Cache heavy queries — use lens.cache_refresh(table) before exploration sessions to avoid repeated cloud reads.
  4. Name variables clearly — since Marimo tracks dependencies by variable name, clear names make the reactive graph easier to follow.
  5. Use mo.stop() — to prevent a cell from running until a condition is met:
mo.stop(not table_dropdown.value, "Select a table above to continue.")
# Code below only runs when a table is selected

Marimo makes data exploration feel like using a dashboard, but with the full power of Python and SQL. Combined with DataSpoc Lens, you get a reactive interface to terabytes of Parquet data in the cloud — no infrastructure, no stale state, no manual re-runs.

Recommended