medalliondata-lakearchitecturedatabricksdata-engineeringtutorial

Building a Medallion Architecture Data Lake with DataSpoc

Michael San Martim · 2026-04-29

The medallion architecture (Bronze → Silver → Gold) is the most popular pattern for organizing data lakes. Databricks popularized it, but you don’t need Databricks to implement it.

With DataSpoc Pipe and Lens, you can build a full medallion lake on S3 using just pip install — no Spark, no cluster, no $50k/year license.

What is the Medallion Architecture?

Three layers, each with a clear purpose:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Sources ──→ Bronze ──→ Silver ──→ Gold                    │
│   (raw)       (ingested)  (cleaned)  (business-ready)       │
│                                                             │
│   Pipe writes  Pipe writes  Lens transforms  Lens transforms│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Layer	Also called	Who writes	Who reads	Quality
Bronze	Raw	Pipe (ingest)	Data Engineers	As-is from source
Silver	Curated / Clean	Pipe (transforms)	Analysts, Engineers	Cleaned, typed, deduplicated
Gold	Aggregated / Business	Lens (SQL transforms)	Everyone, AI agents	Business metrics, ready to query

The Bucket Structure

DataSpoc’s bucket convention maps directly to medallion:

s3://company-lake/
  .dataspoc/
    manifest.json                        # Catalog (auto-updated)
    state/<pipeline>/state.json          # Incremental bookmarks
    logs/<pipeline>/<timestamp>.json     # Execution logs

  raw/                                   # ← BRONZE
    postgres/
      orders/dt=2026-04-28/orders_0000.parquet
      customers/dt=2026-04-28/customers_0000.parquet
    stripe/
      payments/dt=2026-04-28/payments_0000.parquet
    hubspot/
      contacts/dt=2026-04-28/contacts_0000.parquet

  curated/                               # ← SILVER
    finance/
      clean_orders/dt=2026-04-28/clean_orders_0000.parquet
      clean_customers/dt=2026-04-28/clean_customers_0000.parquet
    marketing/
      clean_contacts/dt=2026-04-28/clean_contacts_0000.parquet

  gold/                                  # ← GOLD
    finance/
      monthly_revenue/monthly_revenue_0000.parquet
      customer_360/customer_360_0000.parquet
    executive/
      kpi_dashboard/kpi_dashboard_0000.parquet

Step 1: Bronze Layer — Ingest with Pipe

Bronze is raw data, as-is from the source. Pipe handles this with zero transformation.

pip install dataspoc-pipe[s3]
dataspoc-pipe init

Add your sources

# PostgreSQL production database
dataspoc-pipe add postgres-prod

# Stripe payments
dataspoc-pipe add stripe-payments

# HubSpot CRM
dataspoc-pipe add hubspot-crm

Pipeline configs

~/.dataspoc-pipe/pipelines/postgres-prod.yaml:

source:
  tap: tap-postgres
  config: ~/.dataspoc-pipe/sources/postgres-prod.json
  streams:
    - orders
    - customers
    - products

destination:
  bucket: s3://company-lake
  path: raw
  compression: zstd

incremental:
  enabled: true

schedule:
  cron: "0 */6 * * *"

~/.dataspoc-pipe/sources/postgres-prod.json:

{
  "host": "db.company.com",
  "port": 5432,
  "user": "dataspoc_reader",
  "dbname": "production",
  "filter_schemas": ["public"]
}

Run and schedule

# Run all pipelines
dataspoc-pipe run _ --all

# Install cron schedules
dataspoc-pipe schedule install

# Check status
dataspoc-pipe status

Result: Raw data lands in s3://company-lake/raw/<source>/<table>/ as Parquet. This is your Bronze layer.

Step 2: Silver Layer — Clean with Pipe Transforms

Silver is cleaned, typed, deduplicated data. Pipe’s convention-based transforms handle this during ingestion.

Create transform files

~/.dataspoc-pipe/transforms/postgres-prod.py:

"""Transform raw Postgres data during ingestion."""

def transform(df):
    """Called per batch during extraction. Receives a pandas DataFrame."""

    # Standardize email to lowercase
    if "email" in df.columns:
        df["email"] = df["email"].str.lower().str.strip()

    # Remove test/internal records
    if "email" in df.columns:
        df = df[~df["email"].str.endswith("@test.com")]

    # Parse dates (some come as strings)
    for col in ["created_at", "updated_at"]:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors="coerce")

    # Drop duplicates by primary key
    if "id" in df.columns:
        df = df.drop_duplicates(subset=["id"], keep="last")

    # Remove null IDs
    if "id" in df.columns:
        df = df.dropna(subset=["id"])

    return df

Now change the destination to curated for cleaned data:

~/.dataspoc-pipe/pipelines/postgres-prod-clean.yaml:

source:
  tap: tap-postgres
  config: ~/.dataspoc-pipe/sources/postgres-prod.json
  streams:
    - orders
    - customers

destination:
  bucket: s3://company-lake
  path: curated/finance
  compression: zstd

incremental:
  enabled: true

schedule:
  cron: "30 */6 * * *"  # 30 min after bronze

dataspoc-pipe run postgres-prod-clean

Result: Clean data lands in s3://company-lake/curated/finance/<table>/. This is your Silver layer.

Alternative: Silver via Lens SQL Transforms

If you prefer SQL over Python for cleaning:

~/.dataspoc-lens/transforms/001_clean_orders.sql:

CREATE OR REPLACE TABLE clean_orders AS
SELECT
    id,
    customer_id,
    CAST(total AS DOUBLE) AS total,
    LOWER(TRIM(status)) AS status,
    created_at,
    updated_at
FROM orders
WHERE id IS NOT NULL
  AND total > 0
  AND status IN ('pending', 'shipped', 'canceled');

~/.dataspoc-lens/transforms/002_clean_customers.sql:

CREATE OR REPLACE TABLE clean_customers AS
SELECT
    id,
    COALESCE(name, 'Unknown') AS name,
    LOWER(TRIM(email)) AS email,
    country,
    created_at
FROM customers
WHERE id IS NOT NULL
  AND email NOT LIKE '%@test.com';

dataspoc-lens transform run

Step 3: Gold Layer — Aggregate with Lens

Gold is business-ready: aggregations, joins, KPIs. Lens SQL transforms handle this.

~/.dataspoc-lens/transforms/003_customer_360.sql:

CREATE OR REPLACE TABLE customer_360 AS
SELECT
    c.id AS customer_id,
    c.name,
    c.email,
    c.country,
    COUNT(o.id) AS total_orders,
    COALESCE(SUM(o.total), 0) AS lifetime_value,
    MIN(o.created_at) AS first_order,
    MAX(o.created_at) AS last_order,
    DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) AS days_since_last_order,
    CASE
        WHEN DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) > 90 THEN 'at_risk'
        WHEN DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) > 30 THEN 'cooling'
        ELSE 'active'
    END AS status
FROM clean_customers c
LEFT JOIN clean_orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email, c.country;

~/.dataspoc-lens/transforms/004_monthly_revenue.sql:

CREATE OR REPLACE TABLE monthly_revenue AS
SELECT
    DATE_TRUNC('month', created_at) AS month,
    COUNT(*) AS order_count,
    SUM(total) AS revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(total) / COUNT(DISTINCT customer_id) AS revenue_per_customer
FROM clean_orders
WHERE status != 'canceled'
GROUP BY 1
ORDER BY 1;

~/.dataspoc-lens/transforms/005_kpi_dashboard.sql:

CREATE OR REPLACE TABLE kpi_dashboard AS
SELECT
    (SELECT COUNT(*) FROM clean_customers) AS total_customers,
    (SELECT COUNT(*) FROM clean_customers WHERE status = 'active') AS active_customers,
    (SELECT SUM(total) FROM clean_orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)) AS mtd_revenue,
    (SELECT COUNT(*) FROM clean_orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)) AS mtd_orders,
    (SELECT AVG(lifetime_value) FROM customer_360) AS avg_ltv,
    (SELECT COUNT(*) FROM customer_360 WHERE status = 'at_risk') AS at_risk_customers;

dataspoc-lens transform list
dataspoc-lens transform run

Result: Business-ready tables in Gold. Query them instantly:

dataspoc-lens query "SELECT * FROM kpi_dashboard"
dataspoc-lens query "SELECT * FROM monthly_revenue ORDER BY month DESC LIMIT 12"
dataspoc-lens ask "which customers are at risk of churning?"

The Full Pipeline: Bronze → Silver → Gold

Every 6 hours (cron):

1. dataspoc-pipe run _ --all          # Bronze: ingest raw data
2. dataspoc-pipe run _ --all          # Silver: ingest with transforms
3. dataspoc-lens transform run        # Gold: SQL aggregations

Or automate with a simple script:

#!/bin/bash
dataspoc-pipe run postgres-prod
dataspoc-pipe run postgres-prod-clean
dataspoc-pipe run stripe-payments
dataspoc-lens transform run
echo "Medallion refresh complete at $(date)"

Schedule the script:

# Run every 6 hours
crontab -e
0 */6 * * * /path/to/refresh-lake.sh >> /var/log/lake-refresh.log 2>&1

Query Every Layer

With Lens, all three layers are queryable:

dataspoc-lens add-bucket s3://company-lake
dataspoc-lens shell

-- Bronze: raw data (debug, audit)
lens> SELECT * FROM orders LIMIT 5;

-- Silver: clean data (analysis)
lens> SELECT * FROM clean_orders WHERE status = 'shipped' LIMIT 5;

-- Gold: business metrics (dashboards, reports)
lens> SELECT * FROM monthly_revenue ORDER BY month DESC LIMIT 12;
lens> SELECT * FROM customer_360 WHERE status = 'at_risk';
lens> SELECT * FROM kpi_dashboard;

Or ask in natural language:

dataspoc-lens ask "monthly revenue trend for the last year"
dataspoc-lens ask "top 10 customers by lifetime value"
dataspoc-lens ask "how many customers are at risk of churning?"

Let AI Agents Query the Gold Layer

Connect Claude, Cursor, or any MCP agent to the Gold layer:

dataspoc-lens mcp

User: "Give me a summary of this month's KPIs."

Agent: [MCP] query("SELECT * FROM kpi_dashboard")

Agent: "Here's this month's performance:
  - 12,847 total customers (9,231 active)
  - $487k MTD revenue from 3,241 orders
  - Average LTV: $1,247
  - 847 customers flagged as at-risk (no order in 90+ days)"

Medallion vs Raw/Clean/Curated Naming

Two common naming conventions — same concept:

Medallion	Alternative	DataSpoc path	Who writes
Bronze	Raw	`raw/<source>/<table>/`	Pipe
Silver	Clean / Curated	`curated/<domain>/<table>/`	Pipe transforms or Lens transforms
Gold	Aggregated / Business	`gold/<domain>/<table>/`	Lens transforms

DataSpoc’s default convention uses raw/curated/gold which maps to both naming styles. Use whichever your team prefers.

Comparison: Databricks Medallion vs DataSpoc

	Databricks	DataSpoc
Setup	Cluster + workspace + notebooks	`pip install dataspoc-pipe dataspoc-lens`
Bronze	Auto Loader + Delta Live Tables	`dataspoc-pipe run`
Silver	Spark transformations	Pipe transforms (Python) or Lens transforms (SQL)
Gold	Spark SQL + materialized views	Lens SQL transforms (CTAS)
Cost	$3k-10k/month	$0 (+ S3 storage)
Format	Delta Lake	Parquet (open, no lock-in)
AI agents	Not native	MCP + SDK built-in
Scale	Petabytes	Up to ~100GB per query (DuckDB)

When to use Databricks instead

Petabyte-scale data
Real-time streaming (Structured Streaming)
Team already invested in Spark
Need for ACID transactions on the lake (Delta Lake)
Complex ML pipelines with MLflow

When DataSpoc is enough

Data under 100GB per table
Team of 1-20 people
Budget-conscious (startup, small company)
Want AI agent integration
Prefer CLI over notebooks
Don’t want vendor lock-in

Full Working Example

Here’s the complete setup from zero to medallion:

# Install
pip install dataspoc-pipe[s3] dataspoc-lens[s3,ai]

# Bronze: ingest
dataspoc-pipe init
dataspoc-pipe add postgres-prod
dataspoc-pipe run postgres-prod

# Silver: clean (via Lens SQL)
dataspoc-lens init
dataspoc-lens add-bucket s3://company-lake

cat > ~/.dataspoc-lens/transforms/001_clean_orders.sql << 'EOF'
CREATE OR REPLACE TABLE clean_orders AS
SELECT id, customer_id, CAST(total AS DOUBLE) AS total,
       LOWER(TRIM(status)) AS status, created_at
FROM orders WHERE id IS NOT NULL AND total > 0;
EOF

cat > ~/.dataspoc-lens/transforms/002_clean_customers.sql << 'EOF'
CREATE OR REPLACE TABLE clean_customers AS
SELECT id, COALESCE(name, 'Unknown') AS name,
       LOWER(TRIM(email)) AS email, created_at
FROM customers WHERE id IS NOT NULL;
EOF

# Gold: aggregate
cat > ~/.dataspoc-lens/transforms/003_customer_360.sql << 'EOF'
CREATE OR REPLACE TABLE customer_360 AS
SELECT c.id, c.name, c.email,
       COUNT(o.id) AS orders, COALESCE(SUM(o.total), 0) AS ltv
FROM clean_customers c
LEFT JOIN clean_orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email;
EOF

# Run transforms
dataspoc-lens transform run

# Query Gold
dataspoc-lens ask "top customers by lifetime value"

# Connect AI agent
dataspoc-lens mcp

Total time: 30 minutes. Total cost: $0.

The medallion architecture isn’t about Databricks. It’s about organizing data in layers. DataSpoc gives you the same pattern — at a fraction of the cost and complexity.

← Back to blog