medalliondata-lakearchitecturedatabricksdata-engineeringtutorial

Construindo um Data Lake com Arquitetura Medallion Usando DataSpoc

Michael San Martim · 2026-04-29

A arquitetura medallion (Bronze → Silver → Gold) é o padrão mais popular para organizar data lakes. O Databricks popularizou, mas você não precisa do Databricks para implementá-la.

Com DataSpoc Pipe e Lens, você pode construir um data lake medallion completo no S3 usando apenas pip install — sem Spark, sem cluster, sem licença de $50k/ano.

O Que É a Arquitetura Medallion?

Três camadas, cada uma com um propósito claro:

┌─────────────────────────────────────────────────────────────┐
│ │
│ Sources ──→ Bronze ──→ Silver ──→ Gold │
│ (raw) (ingested) (cleaned) (business-ready) │
│ │
│ Pipe writes Pipe writes Lens transforms Lens transforms│
│ │
└─────────────────────────────────────────────────────────────┘
CamadaTambém chamadaQuem escreveQuem lêQualidade
BronzeRawPipe (ingestão)Engenheiros de DadosComo veio da fonte
SilverCurated / CleanPipe (transforms)Analistas, EngenheirosLimpo, tipado, deduplicado
GoldAggregated / BusinessLens (SQL transforms)Todos, agentes de IAMétricas de negócio, pronto para consulta

A Estrutura do Bucket

A convenção de bucket do DataSpoc mapeia diretamente para medallion:

s3://company-lake/
.dataspoc/
manifest.json # Catalog (auto-updated)
state/<pipeline>/state.json # Incremental bookmarks
logs/<pipeline>/<timestamp>.json # Execution logs
raw/ # ← BRONZE
postgres/
orders/dt=2026-04-28/orders_0000.parquet
customers/dt=2026-04-28/customers_0000.parquet
stripe/
payments/dt=2026-04-28/payments_0000.parquet
hubspot/
contacts/dt=2026-04-28/contacts_0000.parquet
curated/ # ← SILVER
finance/
clean_orders/dt=2026-04-28/clean_orders_0000.parquet
clean_customers/dt=2026-04-28/clean_customers_0000.parquet
marketing/
clean_contacts/dt=2026-04-28/clean_contacts_0000.parquet
gold/ # ← GOLD
finance/
monthly_revenue/monthly_revenue_0000.parquet
customer_360/customer_360_0000.parquet
executive/
kpi_dashboard/kpi_dashboard_0000.parquet

Passo 1: Camada Bronze — Ingestão com Pipe

Bronze são dados brutos, como vieram da fonte. O Pipe cuida disso sem nenhuma transformação.

Terminal window
pip install dataspoc-pipe[s3]
dataspoc-pipe init

Adicione suas fontes

Terminal window
# PostgreSQL production database
dataspoc-pipe add postgres-prod
# Stripe payments
dataspoc-pipe add stripe-payments
# HubSpot CRM
dataspoc-pipe add hubspot-crm

Configurações do pipeline

~/.dataspoc-pipe/pipelines/postgres-prod.yaml:

source:
tap: tap-postgres
config: ~/.dataspoc-pipe/sources/postgres-prod.json
streams:
- orders
- customers
- products
destination:
bucket: s3://company-lake
path: raw
compression: zstd
incremental:
enabled: true
schedule:
cron: "0 */6 * * *"

~/.dataspoc-pipe/sources/postgres-prod.json:

{
"host": "db.company.com",
"port": 5432,
"user": "dataspoc_reader",
"dbname": "production",
"filter_schemas": ["public"]
}

Executar e agendar

Terminal window
# Run all pipelines
dataspoc-pipe run _ --all
# Install cron schedules
dataspoc-pipe schedule install
# Check status
dataspoc-pipe status

Resultado: Dados brutos chegam em s3://company-lake/raw/<source>/<table>/ como Parquet. Esta é sua camada Bronze.

Passo 2: Camada Silver — Limpeza com Pipe Transforms

Silver são dados limpos, tipados e deduplicados. Os transforms baseados em convenção do Pipe cuidam disso durante a ingestão.

Criar arquivos de transformação

~/.dataspoc-pipe/transforms/postgres-prod.py:

"""Transform raw Postgres data during ingestion."""
def transform(df):
"""Called per batch during extraction. Receives a pandas DataFrame."""
# Standardize email to lowercase
if "email" in df.columns:
df["email"] = df["email"].str.lower().str.strip()
# Remove test/internal records
if "email" in df.columns:
df = df[~df["email"].str.endswith("@test.com")]
# Parse dates (some come as strings)
for col in ["created_at", "updated_at"]:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
# Drop duplicates by primary key
if "id" in df.columns:
df = df.drop_duplicates(subset=["id"], keep="last")
# Remove null IDs
if "id" in df.columns:
df = df.dropna(subset=["id"])
return df

Agora mude o destino para curated para dados limpos:

~/.dataspoc-pipe/pipelines/postgres-prod-clean.yaml:

source:
tap: tap-postgres
config: ~/.dataspoc-pipe/sources/postgres-prod.json
streams:
- orders
- customers
destination:
bucket: s3://company-lake
path: curated/finance
compression: zstd
incremental:
enabled: true
schedule:
cron: "30 */6 * * *" # 30 min after bronze
Terminal window
dataspoc-pipe run postgres-prod-clean

Resultado: Dados limpos chegam em s3://company-lake/curated/finance/<table>/. Esta é sua camada Silver.

Alternativa: Silver via Lens SQL Transforms

Se você prefere SQL em vez de Python para limpeza:

~/.dataspoc-lens/transforms/001_clean_orders.sql:

CREATE OR REPLACE TABLE clean_orders AS
SELECT
id,
customer_id,
CAST(total AS DOUBLE) AS total,
LOWER(TRIM(status)) AS status,
created_at,
updated_at
FROM orders
WHERE id IS NOT NULL
AND total > 0
AND status IN ('pending', 'shipped', 'canceled');

~/.dataspoc-lens/transforms/002_clean_customers.sql:

CREATE OR REPLACE TABLE clean_customers AS
SELECT
id,
COALESCE(name, 'Unknown') AS name,
LOWER(TRIM(email)) AS email,
country,
created_at
FROM customers
WHERE id IS NOT NULL
AND email NOT LIKE '%@test.com';
Terminal window
dataspoc-lens transform run

Passo 3: Camada Gold — Agregação com Lens

Gold é o que está pronto para o negócio: agregações, joins, KPIs. Lens SQL transforms cuidam disso.

~/.dataspoc-lens/transforms/003_customer_360.sql:

CREATE OR REPLACE TABLE customer_360 AS
SELECT
c.id AS customer_id,
c.name,
c.email,
c.country,
COUNT(o.id) AS total_orders,
COALESCE(SUM(o.total), 0) AS lifetime_value,
MIN(o.created_at) AS first_order,
MAX(o.created_at) AS last_order,
DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) AS days_since_last_order,
CASE
WHEN DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) > 90 THEN 'at_risk'
WHEN DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) > 30 THEN 'cooling'
ELSE 'active'
END AS status
FROM clean_customers c
LEFT JOIN clean_orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email, c.country;

~/.dataspoc-lens/transforms/004_monthly_revenue.sql:

CREATE OR REPLACE TABLE monthly_revenue AS
SELECT
DATE_TRUNC('month', created_at) AS month,
COUNT(*) AS order_count,
SUM(total) AS revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(total) / COUNT(DISTINCT customer_id) AS revenue_per_customer
FROM clean_orders
WHERE status != 'canceled'
GROUP BY 1
ORDER BY 1;

~/.dataspoc-lens/transforms/005_kpi_dashboard.sql:

CREATE OR REPLACE TABLE kpi_dashboard AS
SELECT
(SELECT COUNT(*) FROM clean_customers) AS total_customers,
(SELECT COUNT(*) FROM clean_customers WHERE status = 'active') AS active_customers,
(SELECT SUM(total) FROM clean_orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)) AS mtd_revenue,
(SELECT COUNT(*) FROM clean_orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)) AS mtd_orders,
(SELECT AVG(lifetime_value) FROM customer_360) AS avg_ltv,
(SELECT COUNT(*) FROM customer_360 WHERE status = 'at_risk') AS at_risk_customers;
Terminal window
dataspoc-lens transform list
dataspoc-lens transform run

Resultado: Tabelas prontas para o negócio no Gold. Consulte instantaneamente:

Terminal window
dataspoc-lens query "SELECT * FROM kpi_dashboard"
dataspoc-lens query "SELECT * FROM monthly_revenue ORDER BY month DESC LIMIT 12"
dataspoc-lens ask "which customers are at risk of churning?"

O Pipeline Completo: Bronze → Silver → Gold

Every 6 hours (cron):
1. dataspoc-pipe run _ --all # Bronze: ingest raw data
2. dataspoc-pipe run _ --all # Silver: ingest with transforms
3. dataspoc-lens transform run # Gold: SQL aggregations
Or automate with a simple script:
#!/bin/bash
dataspoc-pipe run postgres-prod
dataspoc-pipe run postgres-prod-clean
dataspoc-pipe run stripe-payments
dataspoc-lens transform run
echo "Medallion refresh complete at $(date)"

Agende o script:

Terminal window
# Run every 6 hours
crontab -e
0 */6 * * * /path/to/refresh-lake.sh >> /var/log/lake-refresh.log 2>&1

Consulte Cada Camada

Com o Lens, todas as três camadas são consultáveis:

Terminal window
dataspoc-lens add-bucket s3://company-lake
dataspoc-lens shell
-- Bronze: raw data (debug, audit)
lens> SELECT * FROM orders LIMIT 5;
-- Silver: clean data (analysis)
lens> SELECT * FROM clean_orders WHERE status = 'shipped' LIMIT 5;
-- Gold: business metrics (dashboards, reports)
lens> SELECT * FROM monthly_revenue ORDER BY month DESC LIMIT 12;
lens> SELECT * FROM customer_360 WHERE status = 'at_risk';
lens> SELECT * FROM kpi_dashboard;

Ou pergunte em linguagem natural:

Terminal window
dataspoc-lens ask "monthly revenue trend for the last year"
dataspoc-lens ask "top 10 customers by lifetime value"
dataspoc-lens ask "how many customers are at risk of churning?"

Deixe Agentes de IA Consultarem a Camada Gold

Conecte Claude, Cursor ou qualquer agente MCP à camada Gold:

Terminal window
dataspoc-lens mcp
User: "Give me a summary of this month's KPIs."
Agent: [MCP] query("SELECT * FROM kpi_dashboard")
Agent: "Here's this month's performance:
- 12,847 total customers (9,231 active)
- $487k MTD revenue from 3,241 orders
- Average LTV: $1,247
- 847 customers flagged as at-risk (no order in 90+ days)"

Medallion vs Nomenclatura Raw/Clean/Curated

Duas convenções de nomenclatura comuns — mesmo conceito:

MedallionAlternativaCaminho DataSpocQuem escreve
BronzeRawraw/<source>/<table>/Pipe
SilverClean / Curatedcurated/<domain>/<table>/Pipe transforms ou Lens transforms
GoldAggregated / Businessgold/<domain>/<table>/Lens transforms

A convenção padrão do DataSpoc usa raw/curated/gold que mapeia para ambos os estilos de nomenclatura. Use o que sua equipe preferir.

Comparação: Databricks Medallion vs DataSpoc

DatabricksDataSpoc
SetupCluster + workspace + notebookspip install dataspoc-pipe dataspoc-lens
BronzeAuto Loader + Delta Live Tablesdataspoc-pipe run
SilverTransformações SparkPipe transforms (Python) ou Lens transforms (SQL)
GoldSpark SQL + materialized viewsLens SQL transforms (CTAS)
Custo$3k-10k/mês$0 (+ armazenamento S3)
FormatoDelta LakeParquet (aberto, sem lock-in)
Agentes de IANão nativoMCP + SDK integrado
EscalaPetabytesAté ~100GB por consulta (DuckDB)

Quando usar Databricks em vez disso

  • Dados em escala de petabytes
  • Streaming em tempo real (Structured Streaming)
  • Equipe já investida em Spark
  • Necessidade de transações ACID no lake (Delta Lake)
  • Pipelines de ML complexos com MLflow

Quando DataSpoc é suficiente

  • Dados abaixo de 100GB por tabela
  • Equipe de 1-20 pessoas
  • Consciente de custos (startup, empresa pequena)
  • Quer integração com agentes de IA
  • Prefere CLI a notebooks
  • Não quer lock-in de fornecedor

Exemplo Completo Funcional

Aqui está a configuração completa do zero ao medallion:

Terminal window
# Install
pip install dataspoc-pipe[s3] dataspoc-lens[s3,ai]
# Bronze: ingest
dataspoc-pipe init
dataspoc-pipe add postgres-prod
dataspoc-pipe run postgres-prod
# Silver: clean (via Lens SQL)
dataspoc-lens init
dataspoc-lens add-bucket s3://company-lake
cat > ~/.dataspoc-lens/transforms/001_clean_orders.sql << 'EOF'
CREATE OR REPLACE TABLE clean_orders AS
SELECT id, customer_id, CAST(total AS DOUBLE) AS total,
LOWER(TRIM(status)) AS status, created_at
FROM orders WHERE id IS NOT NULL AND total > 0;
EOF
cat > ~/.dataspoc-lens/transforms/002_clean_customers.sql << 'EOF'
CREATE OR REPLACE TABLE clean_customers AS
SELECT id, COALESCE(name, 'Unknown') AS name,
LOWER(TRIM(email)) AS email, created_at
FROM customers WHERE id IS NOT NULL;
EOF
# Gold: aggregate
cat > ~/.dataspoc-lens/transforms/003_customer_360.sql << 'EOF'
CREATE OR REPLACE TABLE customer_360 AS
SELECT c.id, c.name, c.email,
COUNT(o.id) AS orders, COALESCE(SUM(o.total), 0) AS ltv
FROM clean_customers c
LEFT JOIN clean_orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email;
EOF
# Run transforms
dataspoc-lens transform run
# Query Gold
dataspoc-lens ask "top customers by lifetime value"
# Connect AI agent
dataspoc-lens mcp

Tempo total: 30 minutos. Custo total: $0.


A arquitetura medallion não é sobre Databricks. É sobre organizar dados em camadas. O DataSpoc te dá o mesmo padrão — por uma fração do custo e complexidade.

Recomendados