Construindo um Data Lake com Arquitetura Medallion Usando DataSpoc
A arquitetura medallion (Bronze → Silver → Gold) é o padrão mais popular para organizar data lakes. O Databricks popularizou, mas você não precisa do Databricks para implementá-la.
Com DataSpoc Pipe e Lens, você pode construir um data lake medallion completo no S3 usando apenas pip install — sem Spark, sem cluster, sem licença de $50k/ano.
O Que É a Arquitetura Medallion?
Três camadas, cada uma com um propósito claro:
┌─────────────────────────────────────────────────────────────┐│ ││ Sources ──→ Bronze ──→ Silver ──→ Gold ││ (raw) (ingested) (cleaned) (business-ready) ││ ││ Pipe writes Pipe writes Lens transforms Lens transforms││ │└─────────────────────────────────────────────────────────────┘| Camada | Também chamada | Quem escreve | Quem lê | Qualidade |
|---|---|---|---|---|
| Bronze | Raw | Pipe (ingestão) | Engenheiros de Dados | Como veio da fonte |
| Silver | Curated / Clean | Pipe (transforms) | Analistas, Engenheiros | Limpo, tipado, deduplicado |
| Gold | Aggregated / Business | Lens (SQL transforms) | Todos, agentes de IA | Métricas de negócio, pronto para consulta |
A Estrutura do Bucket
A convenção de bucket do DataSpoc mapeia diretamente para medallion:
s3://company-lake/ .dataspoc/ manifest.json # Catalog (auto-updated) state/<pipeline>/state.json # Incremental bookmarks logs/<pipeline>/<timestamp>.json # Execution logs
raw/ # ← BRONZE postgres/ orders/dt=2026-04-28/orders_0000.parquet customers/dt=2026-04-28/customers_0000.parquet stripe/ payments/dt=2026-04-28/payments_0000.parquet hubspot/ contacts/dt=2026-04-28/contacts_0000.parquet
curated/ # ← SILVER finance/ clean_orders/dt=2026-04-28/clean_orders_0000.parquet clean_customers/dt=2026-04-28/clean_customers_0000.parquet marketing/ clean_contacts/dt=2026-04-28/clean_contacts_0000.parquet
gold/ # ← GOLD finance/ monthly_revenue/monthly_revenue_0000.parquet customer_360/customer_360_0000.parquet executive/ kpi_dashboard/kpi_dashboard_0000.parquetPasso 1: Camada Bronze — Ingestão com Pipe
Bronze são dados brutos, como vieram da fonte. O Pipe cuida disso sem nenhuma transformação.
pip install dataspoc-pipe[s3]dataspoc-pipe initAdicione suas fontes
# PostgreSQL production databasedataspoc-pipe add postgres-prod
# Stripe paymentsdataspoc-pipe add stripe-payments
# HubSpot CRMdataspoc-pipe add hubspot-crmConfigurações do pipeline
~/.dataspoc-pipe/pipelines/postgres-prod.yaml:
source: tap: tap-postgres config: ~/.dataspoc-pipe/sources/postgres-prod.json streams: - orders - customers - products
destination: bucket: s3://company-lake path: raw compression: zstd
incremental: enabled: true
schedule: cron: "0 */6 * * *"~/.dataspoc-pipe/sources/postgres-prod.json:
{ "host": "db.company.com", "port": 5432, "user": "dataspoc_reader", "dbname": "production", "filter_schemas": ["public"]}Executar e agendar
# Run all pipelinesdataspoc-pipe run _ --all
# Install cron schedulesdataspoc-pipe schedule install
# Check statusdataspoc-pipe statusResultado: Dados brutos chegam em s3://company-lake/raw/<source>/<table>/ como Parquet. Esta é sua camada Bronze.
Passo 2: Camada Silver — Limpeza com Pipe Transforms
Silver são dados limpos, tipados e deduplicados. Os transforms baseados em convenção do Pipe cuidam disso durante a ingestão.
Criar arquivos de transformação
~/.dataspoc-pipe/transforms/postgres-prod.py:
"""Transform raw Postgres data during ingestion."""
def transform(df): """Called per batch during extraction. Receives a pandas DataFrame."""
# Standardize email to lowercase if "email" in df.columns: df["email"] = df["email"].str.lower().str.strip()
# Remove test/internal records if "email" in df.columns: df = df[~df["email"].str.endswith("@test.com")]
# Parse dates (some come as strings) for col in ["created_at", "updated_at"]: if col in df.columns: df[col] = pd.to_datetime(df[col], errors="coerce")
# Drop duplicates by primary key if "id" in df.columns: df = df.drop_duplicates(subset=["id"], keep="last")
# Remove null IDs if "id" in df.columns: df = df.dropna(subset=["id"])
return dfAgora mude o destino para curated para dados limpos:
~/.dataspoc-pipe/pipelines/postgres-prod-clean.yaml:
source: tap: tap-postgres config: ~/.dataspoc-pipe/sources/postgres-prod.json streams: - orders - customers
destination: bucket: s3://company-lake path: curated/finance compression: zstd
incremental: enabled: true
schedule: cron: "30 */6 * * *" # 30 min after bronzedataspoc-pipe run postgres-prod-cleanResultado: Dados limpos chegam em s3://company-lake/curated/finance/<table>/. Esta é sua camada Silver.
Alternativa: Silver via Lens SQL Transforms
Se você prefere SQL em vez de Python para limpeza:
~/.dataspoc-lens/transforms/001_clean_orders.sql:
CREATE OR REPLACE TABLE clean_orders ASSELECT id, customer_id, CAST(total AS DOUBLE) AS total, LOWER(TRIM(status)) AS status, created_at, updated_atFROM ordersWHERE id IS NOT NULL AND total > 0 AND status IN ('pending', 'shipped', 'canceled');~/.dataspoc-lens/transforms/002_clean_customers.sql:
CREATE OR REPLACE TABLE clean_customers ASSELECT id, COALESCE(name, 'Unknown') AS name, LOWER(TRIM(email)) AS email, country, created_atFROM customersWHERE id IS NOT NULL AND email NOT LIKE '%@test.com';dataspoc-lens transform runPasso 3: Camada Gold — Agregação com Lens
Gold é o que está pronto para o negócio: agregações, joins, KPIs. Lens SQL transforms cuidam disso.
~/.dataspoc-lens/transforms/003_customer_360.sql:
CREATE OR REPLACE TABLE customer_360 ASSELECT c.id AS customer_id, c.name, c.email, c.country, COUNT(o.id) AS total_orders, COALESCE(SUM(o.total), 0) AS lifetime_value, MIN(o.created_at) AS first_order, MAX(o.created_at) AS last_order, DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) AS days_since_last_order, CASE WHEN DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) > 90 THEN 'at_risk' WHEN DATEDIFF('day', MAX(o.created_at), CURRENT_DATE) > 30 THEN 'cooling' ELSE 'active' END AS statusFROM clean_customers cLEFT JOIN clean_orders o ON c.id = o.customer_idGROUP BY c.id, c.name, c.email, c.country;~/.dataspoc-lens/transforms/004_monthly_revenue.sql:
CREATE OR REPLACE TABLE monthly_revenue ASSELECT DATE_TRUNC('month', created_at) AS month, COUNT(*) AS order_count, SUM(total) AS revenue, COUNT(DISTINCT customer_id) AS unique_customers, SUM(total) / COUNT(DISTINCT customer_id) AS revenue_per_customerFROM clean_ordersWHERE status != 'canceled'GROUP BY 1ORDER BY 1;~/.dataspoc-lens/transforms/005_kpi_dashboard.sql:
CREATE OR REPLACE TABLE kpi_dashboard ASSELECT (SELECT COUNT(*) FROM clean_customers) AS total_customers, (SELECT COUNT(*) FROM clean_customers WHERE status = 'active') AS active_customers, (SELECT SUM(total) FROM clean_orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)) AS mtd_revenue, (SELECT COUNT(*) FROM clean_orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)) AS mtd_orders, (SELECT AVG(lifetime_value) FROM customer_360) AS avg_ltv, (SELECT COUNT(*) FROM customer_360 WHERE status = 'at_risk') AS at_risk_customers;dataspoc-lens transform listdataspoc-lens transform runResultado: Tabelas prontas para o negócio no Gold. Consulte instantaneamente:
dataspoc-lens query "SELECT * FROM kpi_dashboard"dataspoc-lens query "SELECT * FROM monthly_revenue ORDER BY month DESC LIMIT 12"dataspoc-lens ask "which customers are at risk of churning?"O Pipeline Completo: Bronze → Silver → Gold
Every 6 hours (cron):
1. dataspoc-pipe run _ --all # Bronze: ingest raw data2. dataspoc-pipe run _ --all # Silver: ingest with transforms3. dataspoc-lens transform run # Gold: SQL aggregations
Or automate with a simple script:
#!/bin/bashdataspoc-pipe run postgres-proddataspoc-pipe run postgres-prod-cleandataspoc-pipe run stripe-paymentsdataspoc-lens transform runecho "Medallion refresh complete at $(date)"Agende o script:
# Run every 6 hourscrontab -e0 */6 * * * /path/to/refresh-lake.sh >> /var/log/lake-refresh.log 2>&1Consulte Cada Camada
Com o Lens, todas as três camadas são consultáveis:
dataspoc-lens add-bucket s3://company-lakedataspoc-lens shell-- Bronze: raw data (debug, audit)lens> SELECT * FROM orders LIMIT 5;
-- Silver: clean data (analysis)lens> SELECT * FROM clean_orders WHERE status = 'shipped' LIMIT 5;
-- Gold: business metrics (dashboards, reports)lens> SELECT * FROM monthly_revenue ORDER BY month DESC LIMIT 12;lens> SELECT * FROM customer_360 WHERE status = 'at_risk';lens> SELECT * FROM kpi_dashboard;Ou pergunte em linguagem natural:
dataspoc-lens ask "monthly revenue trend for the last year"dataspoc-lens ask "top 10 customers by lifetime value"dataspoc-lens ask "how many customers are at risk of churning?"Deixe Agentes de IA Consultarem a Camada Gold
Conecte Claude, Cursor ou qualquer agente MCP à camada Gold:
dataspoc-lens mcpUser: "Give me a summary of this month's KPIs."
Agent: [MCP] query("SELECT * FROM kpi_dashboard")
Agent: "Here's this month's performance: - 12,847 total customers (9,231 active) - $487k MTD revenue from 3,241 orders - Average LTV: $1,247 - 847 customers flagged as at-risk (no order in 90+ days)"Medallion vs Nomenclatura Raw/Clean/Curated
Duas convenções de nomenclatura comuns — mesmo conceito:
| Medallion | Alternativa | Caminho DataSpoc | Quem escreve |
|---|---|---|---|
| Bronze | Raw | raw/<source>/<table>/ | Pipe |
| Silver | Clean / Curated | curated/<domain>/<table>/ | Pipe transforms ou Lens transforms |
| Gold | Aggregated / Business | gold/<domain>/<table>/ | Lens transforms |
A convenção padrão do DataSpoc usa raw/curated/gold que mapeia para ambos os estilos de nomenclatura. Use o que sua equipe preferir.
Comparação: Databricks Medallion vs DataSpoc
| Databricks | DataSpoc | |
|---|---|---|
| Setup | Cluster + workspace + notebooks | pip install dataspoc-pipe dataspoc-lens |
| Bronze | Auto Loader + Delta Live Tables | dataspoc-pipe run |
| Silver | Transformações Spark | Pipe transforms (Python) ou Lens transforms (SQL) |
| Gold | Spark SQL + materialized views | Lens SQL transforms (CTAS) |
| Custo | $3k-10k/mês | $0 (+ armazenamento S3) |
| Formato | Delta Lake | Parquet (aberto, sem lock-in) |
| Agentes de IA | Não nativo | MCP + SDK integrado |
| Escala | Petabytes | Até ~100GB por consulta (DuckDB) |
Quando usar Databricks em vez disso
- Dados em escala de petabytes
- Streaming em tempo real (Structured Streaming)
- Equipe já investida em Spark
- Necessidade de transações ACID no lake (Delta Lake)
- Pipelines de ML complexos com MLflow
Quando DataSpoc é suficiente
- Dados abaixo de 100GB por tabela
- Equipe de 1-20 pessoas
- Consciente de custos (startup, empresa pequena)
- Quer integração com agentes de IA
- Prefere CLI a notebooks
- Não quer lock-in de fornecedor
Exemplo Completo Funcional
Aqui está a configuração completa do zero ao medallion:
# Installpip install dataspoc-pipe[s3] dataspoc-lens[s3,ai]
# Bronze: ingestdataspoc-pipe initdataspoc-pipe add postgres-proddataspoc-pipe run postgres-prod
# Silver: clean (via Lens SQL)dataspoc-lens initdataspoc-lens add-bucket s3://company-lake
cat > ~/.dataspoc-lens/transforms/001_clean_orders.sql << 'EOF'CREATE OR REPLACE TABLE clean_orders ASSELECT id, customer_id, CAST(total AS DOUBLE) AS total, LOWER(TRIM(status)) AS status, created_atFROM orders WHERE id IS NOT NULL AND total > 0;EOF
cat > ~/.dataspoc-lens/transforms/002_clean_customers.sql << 'EOF'CREATE OR REPLACE TABLE clean_customers ASSELECT id, COALESCE(name, 'Unknown') AS name, LOWER(TRIM(email)) AS email, created_atFROM customers WHERE id IS NOT NULL;EOF
# Gold: aggregatecat > ~/.dataspoc-lens/transforms/003_customer_360.sql << 'EOF'CREATE OR REPLACE TABLE customer_360 ASSELECT c.id, c.name, c.email, COUNT(o.id) AS orders, COALESCE(SUM(o.total), 0) AS ltvFROM clean_customers cLEFT JOIN clean_orders o ON c.id = o.customer_idGROUP BY c.id, c.name, c.email;EOF
# Run transformsdataspoc-lens transform run
# Query Golddataspoc-lens ask "top customers by lifetime value"
# Connect AI agentdataspoc-lens mcpTempo total: 30 minutos. Custo total: $0.
A arquitetura medallion não é sobre Databricks. É sobre organizar dados em camadas. O DataSpoc te dá o mesmo padrão — por uma fração do custo e complexidade.