claude-codemcpdata-engineeringai-agentstutorial

Usando Claude Code como Seu Engenheiro de Dados: MCP + DataSpoc

Michael San Martim · 2026-04-18

O Claude Code é um agente de IA para codificação baseado em terminal. O DataSpoc expõe tanto o Pipe (ingestão) quanto o Lens (consulta) como servidores MCP. Conecte-os e o Claude Code se torna um engenheiro de dados que pode ingerir dados, consultar seu lake, depurar pipelines e corrigir erros — tudo pelo seu terminal.

Instalar as Ferramentas

Terminal window
pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]

Isso instala ambos os CLIs com suporte a servidor MCP. Verifique:

Terminal window
dataspoc-pipe mcp --help
dataspoc-lens mcp --help

Configurar o Claude Code

Adicione os servidores MCP à sua configuração do Claude Code. Crie ou edite .claude.json na raiz do seu projeto:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "s3://my-company-data"
}
},
"dataspoc-pipe": {
"command": "dataspoc-pipe",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "s3://my-company-data"
}
}
}
}

Para Azure ou GCS:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "gs://my-company-data"
}
}
}
}

Reinicie o Claude Code. Ele descobrirá as ferramentas MCP automaticamente.

O Que o Claude Code Pode Fazer Agora

Descobrir Tabelas

Pergunte ao Claude Code:

> What tables are in my data lake?

O Claude Code chama lens.tables() via MCP e responde:

Your data lake has 12 tables:
- raw_postgres_orders (245K rows)
- raw_postgres_customers (18K rows)
- raw_stripe_payments (890K rows)
- curated_sales (245K rows)
- curated_customer_360 (18K rows)
- gold_revenue_by_region (48 rows)
...

Explorar Schemas

> What columns does curated_sales have?

O Claude Code chama lens.schema("curated_sales"):

curated_sales schema:
- order_id: VARCHAR
- customer_id: VARCHAR
- product: VARCHAR
- amount: DOUBLE
- region: VARCHAR
- sale_date: DATE
- payment_method: VARCHAR

Executar Queries

> What were the top 5 regions by revenue last month?

O Claude Code gera e executa SQL:

SELECT region, SUM(amount) as revenue
FROM curated_sales
WHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'
GROUP BY region
ORDER BY revenue DESC
LIMIT 5
Results:
| region | revenue |
|---------------|-------------|
| North America | $1,240,000 |
| Europe | $890,000 |
| Asia Pacific | $670,000 |
| Latin America | $340,000 |
| Middle East | $120,000 |

Disparar Ingestão de Dados

> Pull the latest data from Postgres into the lake

O Claude Code chama as ferramentas MCP do Pipe:

Running pipeline: postgres-to-lake
Source: tap-postgres (host: db.company.com, schema: public)
Tables: orders, customers, products
Mode: incremental (bookmark: 2026-04-17T23:00:00Z)
✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
✓ products: 0 new rows (no changes)
Pipeline completed in 23 seconds.

Ler e Depurar Logs

> Show me the last pipeline run. Did anything fail?

O Claude Code lê os logs do pipeline via MCP do Pipe:

Last run: 2026-04-15T14:30:00Z — postgres-to-lake
Status: PARTIAL FAILURE
✓ orders: 1,247 rows extracted, 1,247 rows loaded
✗ customers: FAILED — connection timeout after 30s
✓ products: 0 rows (no changes)
Error details for customers:
psycopg2.OperationalError: could not connect to server:
Connection timed out. Is the server running on host "db.company.com"
and accepting TCP/IP connections on port 5432?

Corrigir Erros

> The customers extraction failed due to a timeout. Can you increase the timeout and retry?

O Claude Code atualiza a configuração do pipeline e re-executa:

Updated postgres-to-lake pipeline config:
connection_timeout: 30s → 60s
Retrying customers extraction...
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
Pipeline completed successfully.

O Padrão AGENT.md

Crie um arquivo AGENT.md no seu projeto para dar ao Claude Code contexto persistente sobre sua infraestrutura de dados:

# Data Engineering Agent Context
## Infrastructure
- Data lake: s3://my-company-data
- Source database: PostgreSQL at db.company.com:5432/analytics
- Pipelines: dataspoc-pipe (Singer-based)
- Query engine: dataspoc-lens (DuckDB over Parquet)
## Pipelines
| Pipeline | Source | Schedule | Mode |
|----------|--------|----------|------|
| postgres-to-lake | tap-postgres | Every 6h | Incremental |
| stripe-to-lake | tap-stripe | Every 1h | Incremental |
| sheets-to-lake | tap-google-sheets | Daily | Full |
## Common Tasks
- "Refresh the lake" → run all pipelines
- "Check data freshness" → compare latest timestamps in each table
- "Debug pipeline X" → read logs from .dataspoc/logs/X/
## Data Quality Rules
- orders.amount must be > 0
- customers.email must not be null
- All dates must be <= today (no future dates)
## Permissions
- This agent has READ-ONLY access to the data lake
- Pipeline runs require confirmation before execution

O Claude Code lê este arquivo automaticamente e o usa como contexto para cada interação.

Exemplo de Sessão Real

Aqui está uma sessão completa no terminal mostrando o Claude Code trabalhando como engenheiro de dados:

$ claude
> I need to understand why our revenue dashboard shows a dip last Tuesday.
I'll investigate. Let me check the data.
[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue
FROM curated_sales
WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14'
GROUP BY sale_date ORDER BY sale_date]
Here's daily revenue for the past week:
| date | revenue |
|-----------|------------|
| 2026-04-07 | $142,000 |
| 2026-04-08 | $138,000 |
| 2026-04-09 | $23,000 | ← significant drop
| 2026-04-10 | $145,000 |
| 2026-04-11 | $151,000 |
April 9th shows $23K vs a ~$140K average. Let me check if data was
actually ingested that day.
[Calling pipe.logs: postgres-to-lake, date=2026-04-09]
Found the issue. The pipeline on April 9th failed:
Status: FAILED
Error: "disk full on /tmp during Parquet conversion"
Rows extracted: 12,450 (expected ~45,000)
Only partial data was loaded. The remaining ~32,550 orders from April 9th
are missing from the lake.
Recommendation:
1. Clear disk space on the pipeline runner
2. Re-run the pipeline for April 9th with a full extraction
3. Verify the revenue number after re-ingestion
Want me to re-run the pipeline?
> Yes, re-run it.
[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]
✓ Pipeline completed: 44,892 rows extracted and loaded.
Let me verify the revenue now:
[Calling lens.cache_refresh: curated_sales]
[Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']
April 9th revenue is now $141,200 — consistent with other days.
The dashboard dip was caused by incomplete data ingestion, not an
actual revenue drop.

Dicas para Prompts Eficazes

Seja específico sobre tabelas:

# Good
> What's the average order value in curated_sales for Q1?
# Bad
> What's our AOV?

Peça verificação:

> Query the top 10 customers by spend, and cross-check the count
> against raw_orders to make sure no orders are missing.

Encadeie operações:

> Pull fresh data from Stripe, refresh the cache, then tell me
> how many new subscriptions we got this week.

Solicite explicações:

> Run a query to find customers who haven't ordered in 90 days.
> Explain the SQL you used.

Modelo de Segurança

O Claude Code com DataSpoc MCP é seguro por design:

  • Lens MCP é somente leitura. Ele rejeita qualquer SQL que tente escrever, modificar ou deletar dados.
  • Pipe MCP requer confirmação. Execuções de pipeline são controladas por aprovação do usuário.
  • Cloud IAM controla o acesso. O servidor MCP herda suas credenciais de nuvem. Se você não pode acessar um bucket, o Claude Code também não pode.
  • Toda query é SQL. Você pode revisar exatamente o que foi executado. Sem recuperação opaca ou acesso oculto a dados.
  • Sem credenciais na configuração. Os servidores MCP usam sua autenticação de nuvem existente (AWS SSO, gcloud auth, Azure CLI).

O Claude Code se torna um poderoso assistente de engenharia de dados sem introduzir novos riscos de segurança. Ele só pode fazer o que você já poderia fazer — apenas mais rápido.

Recomendados