marimonotebookreactivedata-analysispython

Reactive Data Analysis with Marimo and DataSpoc Lens

Michael San Martim · 2026-04-26

Jupyter notebooks have a dirty secret: cells can run in any order. You can have df defined in cell 10, used in cell 3, and overwritten in cell 7. The state is invisible. Bugs are guaranteed.

Marimo fixes this. It is a reactive notebook where cells automatically re-execute when their dependencies change. Change a filter in cell 2, and every cell that depends on it updates instantly. Pair it with DataSpoc Lens and you get a live, reactive interface to your data lake.

What Is Marimo?

Marimo is a Python notebook where:

Cells auto-execute when you change an upstream variable
No hidden state — the execution order is determined by data flow, not cell position
Pure Python — notebooks are .py files, not JSON blobs
Built-in UI elements — sliders, dropdowns, tables that are reactive by default

Think of it as a spreadsheet for Python: change a cell, everything downstream updates.

Quick Start

Launch Marimo with DataSpoc Lens:

dataspoc-lens notebook --marimo

This starts a Marimo server pre-configured with a DuckDB connection to your data lake. All your buckets and tables are available immediately.

Alternatively, install and launch manually:

pip install marimo dataspoc-lens
marimo edit

Cell 1: Connect to Your Data Lake

import marimo as mo
from dataspoc_lens import LensClient

lens = LensClient()
tables = lens.tables()
mo.md(f"**Connected.** {len(tables)} tables available.")

Output: Connected. 12 tables available.

Cell 2: Interactive Table Selector

table_dropdown = mo.ui.dropdown(
    options=tables,
    label="Select a table",
)
table_dropdown

This renders a dropdown in the notebook. When you select a different table, every cell that references table_dropdown re-executes automatically.

Cell 3: Show Schema (Reactive)

# This cell re-runs whenever table_dropdown.value changes
selected = table_dropdown.value
if selected:
    schema = lens.schema(selected)
    schema_table = mo.ui.table(
        [{"Column": c["name"], "Type": c["type"]} for c in schema],
        label=f"Schema: {selected}",
    )
    schema_table

Select “raw.orders” from the dropdown — this cell instantly shows the columns. Select “raw.customers” — it updates. No re-run button needed.

Cell 4: Dynamic Query with Filters

# Build a query based on the selected table
if selected:
    limit_slider = mo.ui.slider(
        start=10, stop=1000, step=10, value=100,
        label="Row limit",
    )
    mo.hstack([limit_slider])

# This cell depends on both selected and limit_slider
if selected:
    sql = f"SELECT * FROM {selected} LIMIT {limit_slider.value}"
    result = lens.query(sql)
    mo.ui.table(result.to_pandas())

Move the slider from 100 to 500 — the query re-runs and the table updates. The reactive graph looks like:

table_dropdown → selected → sql → result → table display
                              ↑
               limit_slider ──┘

Cell 5: Aggregation with Date Range

if selected == "raw.orders":
    date_range = mo.ui.date_range(
        start="2026-01-01",
        stop="2026-04-26",
        label="Date range",
    )
    date_range

if selected == "raw.orders" and date_range.value:
    start, end = date_range.value
    agg_sql = f"""
        SELECT
            DATE_TRUNC('week', order_date) as week,
            COUNT(*) as orders,
            SUM(amount) as revenue,
            ROUND(AVG(amount), 2) as avg_order
        FROM raw.orders
        WHERE order_date BETWEEN DATE '{start}' AND DATE '{end}'
        GROUP BY 1
        ORDER BY 1
    """
    agg_result = lens.query(agg_sql)
    df = agg_result.to_pandas()

    # Reactive chart
    import altair as alt
    chart = alt.Chart(df).mark_bar().encode(
        x="week:T",
        y="revenue:Q",
        tooltip=["week", "orders", "revenue", "avg_order"],
    ).properties(width=600, height=300)

    mo.vstack([
        mo.ui.table(df),
        chart,
    ])

Change the date range picker — the aggregation query re-runs, the table updates, the chart redraws. All automatic.

The Programmatic connect() API

For scripts and automation, use connect() to get a DuckDB connection directly:

import duckdb
from dataspoc_lens import LensClient

lens = LensClient()
conn = lens.connect()  # Returns a duckdb.DuckDBPyConnection

# Use standard DuckDB API
result = conn.execute("""
    SELECT plan, COUNT(*) as users
    FROM raw.customers
    GROUP BY plan
    ORDER BY users DESC
""").fetchdf()

print(result)

This is useful in Marimo cells when you want raw DuckDB access:

# Marimo cell using connect()
conn = lens.connect()

# Complex analytical query
funnel = conn.execute("""
    WITH signups AS (
        SELECT user_id, MIN(created_at) as signup_date
        FROM raw.events
        WHERE event_type = 'signup'
        GROUP BY user_id
    ),
    activations AS (
        SELECT user_id, MIN(timestamp) as activation_date
        FROM raw.events
        WHERE event_type = 'first_purchase'
        GROUP BY user_id
    )
    SELECT
        DATE_TRUNC('week', s.signup_date) as cohort_week,
        COUNT(DISTINCT s.user_id) as signups,
        COUNT(DISTINCT a.user_id) as activated,
        ROUND(100.0 * COUNT(DISTINCT a.user_id) / COUNT(DISTINCT s.user_id), 1) as activation_rate
    FROM signups s
    LEFT JOIN activations a ON s.user_id = a.user_id
    GROUP BY 1
    ORDER BY 1
""").fetchdf()

mo.ui.table(funnel)

Marimo vs. Jupyter: A Real Comparison

Consider this workflow: explore revenue by region, then drill into the top region.

In Jupyter:

Cell 1: Load data — Run
Cell 2: Group by region — Run
Cell 3: Plot results — Run
Cell 4: Filter to top region — Run
Oh wait, you want to change the date range in Cell 1
Re-run Cell 1 — manually
Re-run Cell 2 — manually
Re-run Cell 3 — manually
Re-run Cell 4 — manually
Did you forget a cell? Is the state stale? Who knows.

In Marimo:

Cell 1: Load data with date picker — it runs
Cell 2: Group by region — auto-runs
Cell 3: Plot results — auto-runs
Cell 4: Filter to top region — auto-runs
Change the date picker in Cell 1
Cells 2, 3, 4 all update automatically
State is always consistent

Feature	Jupyter	Marimo
Execution order	Manual (any order)	Automatic (data flow)
Hidden state	Yes (constant problem)	No (impossible by design)
UI widgets	ipywidgets (manual wiring)	Built-in, reactive
File format	JSON (bad diffs)	Pure Python (clean diffs)
Reproducibility	Run All and hope	Guaranteed by design
DataSpoc integration	`dataspoc-lens notebook`	`dataspoc-lens notebook --marimo`

Tips for Effective Reactive Notebooks

One output per cell — Marimo cells return their last expression. Keep cells focused.
Use UI elements for parameters — mo.ui.slider, mo.ui.dropdown, mo.ui.text are all reactive. Use them instead of hardcoded values.
Cache heavy queries — use lens.cache_refresh(table) before exploration sessions to avoid repeated cloud reads.
Name variables clearly — since Marimo tracks dependencies by variable name, clear names make the reactive graph easier to follow.
Use mo.stop() — to prevent a cell from running until a condition is met:

mo.stop(not table_dropdown.value, "Select a table above to continue.")
# Code below only runs when a table is selected

Marimo makes data exploration feel like using a dashboard, but with the full power of Python and SQL. Combined with DataSpoc Lens, you get a reactive interface to terabytes of Parquet data in the cloud — no infrastructure, no stale state, no manual re-runs.

← Back to blog