Usando Claude Code cómo tu Ingeniero de Datos: MCP + DataSpoc
Claude Code es un agente de codificación basado en terminal. DataSpoc expone tanto Pipe (ingesta) cómo Lens (consultas) cómo servidores MCP. Conéctalos y Claude Code se convierte en un ingeniero de datos que puede ingestar datos, consultar tu lake, depurar pipelines y corregir errores — todo desde tu terminal.
Instalar las Herramientas
pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]Esto instala ambos CLIs con soporte para servidores MCP. Verifica:
dataspoc-pipe mcp --helpdataspoc-lens mcp --helpConfigurar Claude Code
Agrega los servidores MCP a tu configuración de Claude Code. Crea o edita .claude.json en la raíz de tu proyecto:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "s3://my-company-data" } }, "dataspoc-pipe": { "command": "dataspoc-pipe", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "s3://my-company-data" } } }}Para Azure o GCS:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "gs://my-company-data" } } }}Reinicia Claude Code. Descubrirá las herramientas MCP automáticamente.
Qué Puede Hacer Claude Code Ahora
Descubrir Tablas
Pregúntale a Claude Code:
> What tables are in my data lake?Claude Code llama a lens.tables() vía MCP y responde:
Your data lake has 12 tables:- raw_postgres_orders (245K rows)- raw_postgres_customers (18K rows)- raw_stripe_payments (890K rows)- curated_sales (245K rows)- curated_customer_360 (18K rows)- gold_revenue_by_region (48 rows)...Explorar Schemas
> What columns does curated_sales have?Claude Code llama a lens.schema("curated_sales"):
curated_sales schema: - order_id: VARCHAR - customer_id: VARCHAR - product: VARCHAR - amount: DOUBLE - region: VARCHAR - sale_date: DATE - payment_method: VARCHAREjecutar Consultas
> What were the top 5 regions by revenue last month?Claude Code genera y ejecuta SQL:
SELECT region, SUM(amount) as revenueFROM curated_salesWHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'GROUP BY regionORDER BY revenue DESCLIMIT 5Results:| region | revenue ||---------------|-------------|| North America | $1,240,000 || Europe | $890,000 || Asia Pacific | $670,000 || Latin America | $340,000 || Middle East | $120,000 |Disparar Ingesta de Datos
> Pull the latest data from Postgres into the lakeClaude Code llama a las herramientas MCP de Pipe:
Running pipeline: postgres-to-lakeSource: tap-postgres (host: db.company.com, schema: public)Tables: orders, customers, productsMode: incremental (bookmark: 2026-04-17T23:00:00Z)
✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/✓ products: 0 new rows (no changes)
Pipeline completed in 23 seconds.Leer y Depurar Logs
> Show me the last pipeline run. Did anything fail?Claude Code lee los logs del pipeline vía MCP de Pipe:
Last run: 2026-04-15T14:30:00Z — postgres-to-lakeStatus: PARTIAL FAILURE
✓ orders: 1,247 rows extracted, 1,247 rows loaded✗ customers: FAILED — connection timeout after 30s✓ products: 0 rows (no changes)
Error details for customers: psycopg2.OperationalError: could not connect to server: Connection timed out. Is the server running on host "db.company.com" and accepting TCP/IP connections on port 5432?Corregir Errores
> The customers extraction failed due to a timeout. Can you increase the timeout and retry?Claude Code actualiza la configuración del pipeline y lo re-ejecuta:
Updated postgres-to-lake pipeline config: connection_timeout: 30s → 60s
Retrying customers extraction...✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
Pipeline completed successfully.El Patrón AGENT.md
Crea un archivo AGENT.md en tu proyecto para darle a Claude Code contexto persistente sobre tu infraestructura de datos:
# Data Engineering Agent Context
## Infrastructure- Data lake: s3://my-company-data- Source database: PostgreSQL at db.company.com:5432/analytics- Pipelines: dataspoc-pipe (Singer-based)- Query engine: dataspoc-lens (DuckDB over Parquet)
## Pipelines| Pipeline | Source | Schedule | Mode ||----------|--------|----------|------|| postgres-to-lake | tap-postgres | Every 6h | Incremental || stripe-to-lake | tap-stripe | Every 1h | Incremental || sheets-to-lake | tap-google-sheets | Daily | Full |
## Common Tasks- "Refresh the lake" → run all pipelines- "Check data freshness" → compare latest timestamps in each table- "Debug pipeline X" → read logs from .dataspoc/logs/X/
## Data Quality Rules- orders.amount must be > 0- customers.email must not be null- All dates must be <= today (no future dates)
## Permissions- This agent has READ-ONLY access to the data lake- Pipeline runs require confirmation before executionClaude Code lee este archivo automáticamente y lo usa cómo contexto para cada interacción.
Ejemplo de Sesión Real
Aquí hay una sesión de terminal completa mostrando a Claude Code trabajando cómo ingeniero de datos:
$ claude
> I need to understand why our revenue dashboard shows a dip last Tuesday.
I'll investigate. Let me check the data.
[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue FROM curated_sales WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14' GROUP BY sale_date ORDER BY sale_date]
Here's daily revenue for the past week:
| date | revenue ||-----------|------------|| 2026-04-07 | $142,000 || 2026-04-08 | $138,000 || 2026-04-09 | $23,000 | ← significant drop| 2026-04-10 | $145,000 || 2026-04-11 | $151,000 |
April 9th shows $23K vs a ~$140K average. Let me check if data wasactually ingested that day.
[Calling pipe.logs: postgres-to-lake, date=2026-04-09]
Found the issue. The pipeline on April 9th failed: Status: FAILED Error: "disk full on /tmp during Parquet conversion" Rows extracted: 12,450 (expected ~45,000)
Only partial data was loaded. The remaining ~32,550 orders from April 9thare missing from the lake.
Recommendation:1. Clear disk space on the pipeline runner2. Re-run the pipeline for April 9th with a full extraction3. Verify the revenue number after re-ingestion
Want me to re-run the pipeline?
> Yes, re-run it.
[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]
✓ Pipeline completed: 44,892 rows extracted and loaded.
Let me verify the revenue now:
[Calling lens.cache_refresh: curated_sales][Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']
April 9th revenue is now $141,200 — consistent with other days.The dashboard dip was caused by incomplete data ingestion, not anactual revenue drop.Consejos para Prompts Efectivos
Sé específico sobre las tablas:
# Good> What's the average order value in curated_sales for Q1?
# Bad> What's our AOV?Pide verificación:
> Query the top 10 customers by spend, and cross-check the count> against raw_orders to make sure no orders are missing.Encadena operaciónes:
> Pull fresh data from Stripe, refresh the cache, then tell me> how many new subscriptions we got this week.Solicita explicaciones:
> Run a query to find customers who haven't ordered in 90 days.> Explain the SQL you used.Modelo de Seguridad
Claude Code con DataSpoc MCP es seguro por diseño:
- Lens MCP es de solo lectura. Rechaza cualquier SQL que intente escribir, modificar o eliminar datos.
- Pipe MCP requiere confirmación. Las ejecuciones de pipeline están controladas por la aprobación del usuario.
- IAM en la nube controla el acceso. El servidor MCP hereda tus credenciales de nube. Si tú no puedes acceder a un bucket, Claude Code tampoco puede.
- Cada consulta es SQL. Puedes revisar exactamente qué se ejecutó. Sin recuperación opaca ni acceso oculto a datos.
- Sin credenciales en la configuración. Los servidores MCP usan tu autenticación existente en la nube (AWS SSO, gcloud auth, Azure CLI).
Claude Code se convierte en un poderoso asistente de ingeniería de datos sin introducir nuevos riesgos de seguridad. Solo puede hacer lo que tú ya podías hacer — solo que más rápido.