Usando Claude Code como Seu Engenheiro de Dados: MCP + DataSpoc
O Claude Code é um agente de IA para codificação baseado em terminal. O DataSpoc expõe tanto o Pipe (ingestão) quanto o Lens (consulta) como servidores MCP. Conecte-os e o Claude Code se torna um engenheiro de dados que pode ingerir dados, consultar seu lake, depurar pipelines e corrigir erros — tudo pelo seu terminal.
Instalar as Ferramentas
pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]Isso instala ambos os CLIs com suporte a servidor MCP. Verifique:
dataspoc-pipe mcp --helpdataspoc-lens mcp --helpConfigurar o Claude Code
Adicione os servidores MCP à sua configuração do Claude Code. Crie ou edite .claude.json na raiz do seu projeto:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "s3://my-company-data" } }, "dataspoc-pipe": { "command": "dataspoc-pipe", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "s3://my-company-data" } } }}Para Azure ou GCS:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "gs://my-company-data" } } }}Reinicie o Claude Code. Ele descobrirá as ferramentas MCP automaticamente.
O Que o Claude Code Pode Fazer Agora
Descobrir Tabelas
Pergunte ao Claude Code:
> What tables are in my data lake?O Claude Code chama lens.tables() via MCP e responde:
Your data lake has 12 tables:- raw_postgres_orders (245K rows)- raw_postgres_customers (18K rows)- raw_stripe_payments (890K rows)- curated_sales (245K rows)- curated_customer_360 (18K rows)- gold_revenue_by_region (48 rows)...Explorar Schemas
> What columns does curated_sales have?O Claude Code chama lens.schema("curated_sales"):
curated_sales schema: - order_id: VARCHAR - customer_id: VARCHAR - product: VARCHAR - amount: DOUBLE - region: VARCHAR - sale_date: DATE - payment_method: VARCHARExecutar Queries
> What were the top 5 regions by revenue last month?O Claude Code gera e executa SQL:
SELECT region, SUM(amount) as revenueFROM curated_salesWHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'GROUP BY regionORDER BY revenue DESCLIMIT 5Results:| region | revenue ||---------------|-------------|| North America | $1,240,000 || Europe | $890,000 || Asia Pacific | $670,000 || Latin America | $340,000 || Middle East | $120,000 |Disparar Ingestão de Dados
> Pull the latest data from Postgres into the lakeO Claude Code chama as ferramentas MCP do Pipe:
Running pipeline: postgres-to-lakeSource: tap-postgres (host: db.company.com, schema: public)Tables: orders, customers, productsMode: incremental (bookmark: 2026-04-17T23:00:00Z)
✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/✓ products: 0 new rows (no changes)
Pipeline completed in 23 seconds.Ler e Depurar Logs
> Show me the last pipeline run. Did anything fail?O Claude Code lê os logs do pipeline via MCP do Pipe:
Last run: 2026-04-15T14:30:00Z — postgres-to-lakeStatus: PARTIAL FAILURE
✓ orders: 1,247 rows extracted, 1,247 rows loaded✗ customers: FAILED — connection timeout after 30s✓ products: 0 rows (no changes)
Error details for customers: psycopg2.OperationalError: could not connect to server: Connection timed out. Is the server running on host "db.company.com" and accepting TCP/IP connections on port 5432?Corrigir Erros
> The customers extraction failed due to a timeout. Can you increase the timeout and retry?O Claude Code atualiza a configuração do pipeline e re-executa:
Updated postgres-to-lake pipeline config: connection_timeout: 30s → 60s
Retrying customers extraction...✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
Pipeline completed successfully.O Padrão AGENT.md
Crie um arquivo AGENT.md no seu projeto para dar ao Claude Code contexto persistente sobre sua infraestrutura de dados:
# Data Engineering Agent Context
## Infrastructure- Data lake: s3://my-company-data- Source database: PostgreSQL at db.company.com:5432/analytics- Pipelines: dataspoc-pipe (Singer-based)- Query engine: dataspoc-lens (DuckDB over Parquet)
## Pipelines| Pipeline | Source | Schedule | Mode ||----------|--------|----------|------|| postgres-to-lake | tap-postgres | Every 6h | Incremental || stripe-to-lake | tap-stripe | Every 1h | Incremental || sheets-to-lake | tap-google-sheets | Daily | Full |
## Common Tasks- "Refresh the lake" → run all pipelines- "Check data freshness" → compare latest timestamps in each table- "Debug pipeline X" → read logs from .dataspoc/logs/X/
## Data Quality Rules- orders.amount must be > 0- customers.email must not be null- All dates must be <= today (no future dates)
## Permissions- This agent has READ-ONLY access to the data lake- Pipeline runs require confirmation before executionO Claude Code lê este arquivo automaticamente e o usa como contexto para cada interação.
Exemplo de Sessão Real
Aqui está uma sessão completa no terminal mostrando o Claude Code trabalhando como engenheiro de dados:
$ claude
> I need to understand why our revenue dashboard shows a dip last Tuesday.
I'll investigate. Let me check the data.
[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue FROM curated_sales WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14' GROUP BY sale_date ORDER BY sale_date]
Here's daily revenue for the past week:
| date | revenue ||-----------|------------|| 2026-04-07 | $142,000 || 2026-04-08 | $138,000 || 2026-04-09 | $23,000 | ← significant drop| 2026-04-10 | $145,000 || 2026-04-11 | $151,000 |
April 9th shows $23K vs a ~$140K average. Let me check if data wasactually ingested that day.
[Calling pipe.logs: postgres-to-lake, date=2026-04-09]
Found the issue. The pipeline on April 9th failed: Status: FAILED Error: "disk full on /tmp during Parquet conversion" Rows extracted: 12,450 (expected ~45,000)
Only partial data was loaded. The remaining ~32,550 orders from April 9thare missing from the lake.
Recommendation:1. Clear disk space on the pipeline runner2. Re-run the pipeline for April 9th with a full extraction3. Verify the revenue number after re-ingestion
Want me to re-run the pipeline?
> Yes, re-run it.
[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]
✓ Pipeline completed: 44,892 rows extracted and loaded.
Let me verify the revenue now:
[Calling lens.cache_refresh: curated_sales][Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']
April 9th revenue is now $141,200 — consistent with other days.The dashboard dip was caused by incomplete data ingestion, not anactual revenue drop.Dicas para Prompts Eficazes
Seja específico sobre tabelas:
# Good> What's the average order value in curated_sales for Q1?
# Bad> What's our AOV?Peça verificação:
> Query the top 10 customers by spend, and cross-check the count> against raw_orders to make sure no orders are missing.Encadeie operações:
> Pull fresh data from Stripe, refresh the cache, then tell me> how many new subscriptions we got this week.Solicite explicações:
> Run a query to find customers who haven't ordered in 90 days.> Explain the SQL you used.Modelo de Segurança
O Claude Code com DataSpoc MCP é seguro por design:
- Lens MCP é somente leitura. Ele rejeita qualquer SQL que tente escrever, modificar ou deletar dados.
- Pipe MCP requer confirmação. Execuções de pipeline são controladas por aprovação do usuário.
- Cloud IAM controla o acesso. O servidor MCP herda suas credenciais de nuvem. Se você não pode acessar um bucket, o Claude Code também não pode.
- Toda query é SQL. Você pode revisar exatamente o que foi executado. Sem recuperação opaca ou acesso oculto a dados.
- Sem credenciais na configuração. Os servidores MCP usam sua autenticação de nuvem existente (AWS SSO, gcloud auth, Azure CLI).
O Claude Code se torna um poderoso assistente de engenharia de dados sem introduzir novos riscos de segurança. Ele só pode fazer o que você já poderia fazer — apenas mais rápido.