claude-codemcpdata-engineeringai-agentstutorial

Usando Claude Code como Seu Engenheiro de Dados: MCP + DataSpoc

Michael San Martim · 2026-04-18

O Claude Code é um agente de IA para codificação baseado em terminal. O DataSpoc expõe tanto o Pipe (ingestão) quanto o Lens (consulta) como servidores MCP. Conecte-os e o Claude Code se torna um engenheiro de dados que pode ingerir dados, consultar seu lake, depurar pipelines e corrigir erros — tudo pelo seu terminal.

Instalar as Ferramentas

pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]

Isso instala ambos os CLIs com suporte a servidor MCP. Verifique:

dataspoc-pipe mcp --help
dataspoc-lens mcp --help

Configurar o Claude Code

Adicione os servidores MCP à sua configuração do Claude Code. Crie ou edite .claude.json na raiz do seu projeto:

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "s3://my-company-data"
      }
    },
    "dataspoc-pipe": {
      "command": "dataspoc-pipe",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "s3://my-company-data"
      }
    }
  }
}

Para Azure ou GCS:

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "gs://my-company-data"
      }
    }
  }
}

Reinicie o Claude Code. Ele descobrirá as ferramentas MCP automaticamente.

O Que o Claude Code Pode Fazer Agora

Descobrir Tabelas

Pergunte ao Claude Code:

> What tables are in my data lake?

O Claude Code chama lens.tables() via MCP e responde:

Your data lake has 12 tables:
- raw_postgres_orders (245K rows)
- raw_postgres_customers (18K rows)
- raw_stripe_payments (890K rows)
- curated_sales (245K rows)
- curated_customer_360 (18K rows)
- gold_revenue_by_region (48 rows)
...

Explorar Schemas

> What columns does curated_sales have?

O Claude Code chama lens.schema("curated_sales"):

curated_sales schema:
  - order_id: VARCHAR
  - customer_id: VARCHAR
  - product: VARCHAR
  - amount: DOUBLE
  - region: VARCHAR
  - sale_date: DATE
  - payment_method: VARCHAR

Executar Queries

> What were the top 5 regions by revenue last month?

O Claude Code gera e executa SQL:

SELECT region, SUM(amount) as revenue
FROM curated_sales
WHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'
GROUP BY region
ORDER BY revenue DESC
LIMIT 5

Results:
| region         | revenue     |
|---------------|-------------|
| North America | $1,240,000  |
| Europe        | $890,000    |
| Asia Pacific  | $670,000    |
| Latin America | $340,000    |
| Middle East   | $120,000    |

Disparar Ingestão de Dados

> Pull the latest data from Postgres into the lake

O Claude Code chama as ferramentas MCP do Pipe:

Running pipeline: postgres-to-lake
Source: tap-postgres (host: db.company.com, schema: public)
Tables: orders, customers, products
Mode: incremental (bookmark: 2026-04-17T23:00:00Z)

✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
✓ products: 0 new rows (no changes)

Pipeline completed in 23 seconds.

Ler e Depurar Logs

> Show me the last pipeline run. Did anything fail?

O Claude Code lê os logs do pipeline via MCP do Pipe:

Last run: 2026-04-15T14:30:00Z — postgres-to-lake
Status: PARTIAL FAILURE

✓ orders: 1,247 rows extracted, 1,247 rows loaded
✗ customers: FAILED — connection timeout after 30s
✓ products: 0 rows (no changes)

Error details for customers:
  psycopg2.OperationalError: could not connect to server:
  Connection timed out. Is the server running on host "db.company.com"
  and accepting TCP/IP connections on port 5432?

Corrigir Erros

> The customers extraction failed due to a timeout. Can you increase the timeout and retry?

O Claude Code atualiza a configuração do pipeline e re-executa:

Updated postgres-to-lake pipeline config:
  connection_timeout: 30s → 60s

Retrying customers extraction...
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/

Pipeline completed successfully.

O Padrão AGENT.md

Crie um arquivo AGENT.md no seu projeto para dar ao Claude Code contexto persistente sobre sua infraestrutura de dados:

# Data Engineering Agent Context

## Infrastructure
- Data lake: s3://my-company-data
- Source database: PostgreSQL at db.company.com:5432/analytics
- Pipelines: dataspoc-pipe (Singer-based)
- Query engine: dataspoc-lens (DuckDB over Parquet)

## Pipelines
| Pipeline | Source | Schedule | Mode |
|----------|--------|----------|------|
| postgres-to-lake | tap-postgres | Every 6h | Incremental |
| stripe-to-lake | tap-stripe | Every 1h | Incremental |
| sheets-to-lake | tap-google-sheets | Daily | Full |

## Common Tasks
- "Refresh the lake" → run all pipelines
- "Check data freshness" → compare latest timestamps in each table
- "Debug pipeline X" → read logs from .dataspoc/logs/X/

## Data Quality Rules
- orders.amount must be > 0
- customers.email must not be null
- All dates must be <= today (no future dates)

## Permissions
- This agent has READ-ONLY access to the data lake
- Pipeline runs require confirmation before execution

O Claude Code lê este arquivo automaticamente e o usa como contexto para cada interação.

Exemplo de Sessão Real

Aqui está uma sessão completa no terminal mostrando o Claude Code trabalhando como engenheiro de dados:

$ claude

> I need to understand why our revenue dashboard shows a dip last Tuesday.

I'll investigate. Let me check the data.

[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue
 FROM curated_sales
 WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14'
 GROUP BY sale_date ORDER BY sale_date]

Here's daily revenue for the past week:

| date       | revenue    |
|-----------|------------|
| 2026-04-07 | $142,000  |
| 2026-04-08 | $138,000  |
| 2026-04-09 | $23,000   | ← significant drop
| 2026-04-10 | $145,000  |
| 2026-04-11 | $151,000  |

April 9th shows $23K vs a ~$140K average. Let me check if data was
actually ingested that day.

[Calling pipe.logs: postgres-to-lake, date=2026-04-09]

Found the issue. The pipeline on April 9th failed:
  Status: FAILED
  Error: "disk full on /tmp during Parquet conversion"
  Rows extracted: 12,450 (expected ~45,000)

Only partial data was loaded. The remaining ~32,550 orders from April 9th
are missing from the lake.

Recommendation:
1. Clear disk space on the pipeline runner
2. Re-run the pipeline for April 9th with a full extraction
3. Verify the revenue number after re-ingestion

Want me to re-run the pipeline?

> Yes, re-run it.

[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]

✓ Pipeline completed: 44,892 rows extracted and loaded.

Let me verify the revenue now:

[Calling lens.cache_refresh: curated_sales]
[Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']

April 9th revenue is now $141,200 — consistent with other days.
The dashboard dip was caused by incomplete data ingestion, not an
actual revenue drop.

Dicas para Prompts Eficazes

Seja específico sobre tabelas:

# Good
> What's the average order value in curated_sales for Q1?

# Bad
> What's our AOV?

Peça verificação:

> Query the top 10 customers by spend, and cross-check the count
> against raw_orders to make sure no orders are missing.

Encadeie operações:

> Pull fresh data from Stripe, refresh the cache, then tell me
> how many new subscriptions we got this week.

Solicite explicações:

> Run a query to find customers who haven't ordered in 90 days.
> Explain the SQL you used.

Modelo de Segurança

O Claude Code com DataSpoc MCP é seguro por design:

Lens MCP é somente leitura. Ele rejeita qualquer SQL que tente escrever, modificar ou deletar dados.
Pipe MCP requer confirmação. Execuções de pipeline são controladas por aprovação do usuário.
Cloud IAM controla o acesso. O servidor MCP herda suas credenciais de nuvem. Se você não pode acessar um bucket, o Claude Code também não pode.
Toda query é SQL. Você pode revisar exatamente o que foi executado. Sem recuperação opaca ou acesso oculto a dados.
Sem credenciais na configuração. Os servidores MCP usam sua autenticação de nuvem existente (AWS SSO, gcloud auth, Azure CLI).

O Claude Code se torna um poderoso assistente de engenharia de dados sem introduzir novos riscos de segurança. Ele só pode fazer o que você já poderia fazer — apenas mais rápido.

← Voltar ao blog