claude-codemcpdata-engineeringai-agentstutorial

Usando Claude Code cómo tu Ingeniero de Datos: MCP + DataSpoc

Michael San Martim · 2026-04-18

Claude Code es un agente de codificación basado en terminal. DataSpoc expone tanto Pipe (ingesta) cómo Lens (consultas) cómo servidores MCP. Conéctalos y Claude Code se convierte en un ingeniero de datos que puede ingestar datos, consultar tu lake, depurar pipelines y corregir errores — todo desde tu terminal.

Instalar las Herramientas

Terminal window
pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]

Esto instala ambos CLIs con soporte para servidores MCP. Verifica:

Terminal window
dataspoc-pipe mcp --help
dataspoc-lens mcp --help

Configurar Claude Code

Agrega los servidores MCP a tu configuración de Claude Code. Crea o edita .claude.json en la raíz de tu proyecto:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "s3://my-company-data"
}
},
"dataspoc-pipe": {
"command": "dataspoc-pipe",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "s3://my-company-data"
}
}
}
}

Para Azure o GCS:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "gs://my-company-data"
}
}
}
}

Reinicia Claude Code. Descubrirá las herramientas MCP automáticamente.

Qué Puede Hacer Claude Code Ahora

Descubrir Tablas

Pregúntale a Claude Code:

> What tables are in my data lake?

Claude Code llama a lens.tables() vía MCP y responde:

Your data lake has 12 tables:
- raw_postgres_orders (245K rows)
- raw_postgres_customers (18K rows)
- raw_stripe_payments (890K rows)
- curated_sales (245K rows)
- curated_customer_360 (18K rows)
- gold_revenue_by_region (48 rows)
...

Explorar Schemas

> What columns does curated_sales have?

Claude Code llama a lens.schema("curated_sales"):

curated_sales schema:
- order_id: VARCHAR
- customer_id: VARCHAR
- product: VARCHAR
- amount: DOUBLE
- region: VARCHAR
- sale_date: DATE
- payment_method: VARCHAR

Ejecutar Consultas

> What were the top 5 regions by revenue last month?

Claude Code genera y ejecuta SQL:

SELECT region, SUM(amount) as revenue
FROM curated_sales
WHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'
GROUP BY region
ORDER BY revenue DESC
LIMIT 5
Results:
| region | revenue |
|---------------|-------------|
| North America | $1,240,000 |
| Europe | $890,000 |
| Asia Pacific | $670,000 |
| Latin America | $340,000 |
| Middle East | $120,000 |

Disparar Ingesta de Datos

> Pull the latest data from Postgres into the lake

Claude Code llama a las herramientas MCP de Pipe:

Running pipeline: postgres-to-lake
Source: tap-postgres (host: db.company.com, schema: public)
Tables: orders, customers, products
Mode: incremental (bookmark: 2026-04-17T23:00:00Z)
✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
✓ products: 0 new rows (no changes)
Pipeline completed in 23 seconds.

Leer y Depurar Logs

> Show me the last pipeline run. Did anything fail?

Claude Code lee los logs del pipeline vía MCP de Pipe:

Last run: 2026-04-15T14:30:00Z — postgres-to-lake
Status: PARTIAL FAILURE
✓ orders: 1,247 rows extracted, 1,247 rows loaded
✗ customers: FAILED — connection timeout after 30s
✓ products: 0 rows (no changes)
Error details for customers:
psycopg2.OperationalError: could not connect to server:
Connection timed out. Is the server running on host "db.company.com"
and accepting TCP/IP connections on port 5432?

Corregir Errores

> The customers extraction failed due to a timeout. Can you increase the timeout and retry?

Claude Code actualiza la configuración del pipeline y lo re-ejecuta:

Updated postgres-to-lake pipeline config:
connection_timeout: 30s → 60s
Retrying customers extraction...
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
Pipeline completed successfully.

El Patrón AGENT.md

Crea un archivo AGENT.md en tu proyecto para darle a Claude Code contexto persistente sobre tu infraestructura de datos:

# Data Engineering Agent Context
## Infrastructure
- Data lake: s3://my-company-data
- Source database: PostgreSQL at db.company.com:5432/analytics
- Pipelines: dataspoc-pipe (Singer-based)
- Query engine: dataspoc-lens (DuckDB over Parquet)
## Pipelines
| Pipeline | Source | Schedule | Mode |
|----------|--------|----------|------|
| postgres-to-lake | tap-postgres | Every 6h | Incremental |
| stripe-to-lake | tap-stripe | Every 1h | Incremental |
| sheets-to-lake | tap-google-sheets | Daily | Full |
## Common Tasks
- "Refresh the lake" → run all pipelines
- "Check data freshness" → compare latest timestamps in each table
- "Debug pipeline X" → read logs from .dataspoc/logs/X/
## Data Quality Rules
- orders.amount must be > 0
- customers.email must not be null
- All dates must be <= today (no future dates)
## Permissions
- This agent has READ-ONLY access to the data lake
- Pipeline runs require confirmation before execution

Claude Code lee este archivo automáticamente y lo usa cómo contexto para cada interacción.

Ejemplo de Sesión Real

Aquí hay una sesión de terminal completa mostrando a Claude Code trabajando cómo ingeniero de datos:

$ claude
> I need to understand why our revenue dashboard shows a dip last Tuesday.
I'll investigate. Let me check the data.
[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue
FROM curated_sales
WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14'
GROUP BY sale_date ORDER BY sale_date]
Here's daily revenue for the past week:
| date | revenue |
|-----------|------------|
| 2026-04-07 | $142,000 |
| 2026-04-08 | $138,000 |
| 2026-04-09 | $23,000 | ← significant drop
| 2026-04-10 | $145,000 |
| 2026-04-11 | $151,000 |
April 9th shows $23K vs a ~$140K average. Let me check if data was
actually ingested that day.
[Calling pipe.logs: postgres-to-lake, date=2026-04-09]
Found the issue. The pipeline on April 9th failed:
Status: FAILED
Error: "disk full on /tmp during Parquet conversion"
Rows extracted: 12,450 (expected ~45,000)
Only partial data was loaded. The remaining ~32,550 orders from April 9th
are missing from the lake.
Recommendation:
1. Clear disk space on the pipeline runner
2. Re-run the pipeline for April 9th with a full extraction
3. Verify the revenue number after re-ingestion
Want me to re-run the pipeline?
> Yes, re-run it.
[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]
✓ Pipeline completed: 44,892 rows extracted and loaded.
Let me verify the revenue now:
[Calling lens.cache_refresh: curated_sales]
[Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']
April 9th revenue is now $141,200 — consistent with other days.
The dashboard dip was caused by incomplete data ingestion, not an
actual revenue drop.

Consejos para Prompts Efectivos

Sé específico sobre las tablas:

# Good
> What's the average order value in curated_sales for Q1?
# Bad
> What's our AOV?

Pide verificación:

> Query the top 10 customers by spend, and cross-check the count
> against raw_orders to make sure no orders are missing.

Encadena operaciónes:

> Pull fresh data from Stripe, refresh the cache, then tell me
> how many new subscriptions we got this week.

Solicita explicaciones:

> Run a query to find customers who haven't ordered in 90 days.
> Explain the SQL you used.

Modelo de Seguridad

Claude Code con DataSpoc MCP es seguro por diseño:

  • Lens MCP es de solo lectura. Rechaza cualquier SQL que intente escribir, modificar o eliminar datos.
  • Pipe MCP requiere confirmación. Las ejecuciones de pipeline están controladas por la aprobación del usuario.
  • IAM en la nube controla el acceso. El servidor MCP hereda tus credenciales de nube. Si tú no puedes acceder a un bucket, Claude Code tampoco puede.
  • Cada consulta es SQL. Puedes revisar exactamente qué se ejecutó. Sin recuperación opaca ni acceso oculto a datos.
  • Sin credenciales en la configuración. Los servidores MCP usan tu autenticación existente en la nube (AWS SSO, gcloud auth, Azure CLI).

Claude Code se convierte en un poderoso asistente de ingeniería de datos sin introducir nuevos riesgos de seguridad. Solo puede hacer lo que tú ya podías hacer — solo que más rápido.

Recomendados